Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft of nftables support #883

Merged
merged 1 commit into from
Jan 10, 2024

Conversation

mheon
Copy link
Member

@mheon mheon commented Jan 4, 2024

This implementation lacks isolation support at present (that'll be a followon PR) and has only very minimal testing at the moment (including absolutely 0 testing for IPv6), but does handle all core tasks including forwarding and all types of port forwarding.

Fixes #816

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@mheon
Copy link
Member Author

mheon commented Jan 5, 2024

Stripping WIP. Passes testing I've thrown at it. Now needs integration tests.

@mheon mheon force-pushed the nftables branch 3 times, most recently from 601c63a to a9ff1ce Compare January 8, 2024 15:29
@mheon
Copy link
Member Author

mheon commented Jan 8, 2024

Integration tests added.

Caveat: I don't have local access to IPv6 here, so I cannot test IPv6 forwarding and port forwarding.

@mheon mheon force-pushed the nftables branch 2 times, most recently from b6d9c7d to c19a317 Compare January 8, 2024 16:37
@mheon
Copy link
Member Author

mheon commented Jan 8, 2024

@baude @Luap99 I'm still working on the tests a bit but the code is ready for review

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial look, not a full review.

FYI this seems to bloat the binary about 1.5 MB to about 21.5 MB total, not that this a blocker or anything I was just curious. I think we could use some time to debloat this at some point.

stmt::Statement::Match(stmt::Match {
left: expr::Expression::BinaryOperation(expr::BinaryOperation::AND(
Box::new(expr::Expression::Named(expr::NamedExpression::Meta(
expr::Meta {
key: expr::MetaKey::Mark,
},
))),
Box::new(expr::Expression::Number(MASK)),
)),
right: expr::Expression::Number(MASK),
op: stmt::Operator::EQ,
}),
stmt::Statement::Masquerade(None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WTF is this syntax, this looks so hard to use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is atrocious. But somehow better to use than the string API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mheon @Luap99

Hi, I'm the nftables-rs maintainer.

I agree with your concerns regarding the usage of the nftables-rs crate. I'm convinced that using the nftables JSON API is the way to go considering portability and maintainability.

Consider this usage example from nftnl-rs (native binding to libnftables): https://github.com/mullvad/nftnl-rs/blob/main/nftnl/examples/add-rules.rs#L147-L153

Likewise we could add this usability layer to nftables-rs through macros. WDYT?

I'm very interested in your opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using macros that add better syntax sounds great, having something closer to the real nft syntax would be ideal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a massive fan of this idea. The existing syntax can be very verbose, and I ended up with a lot of convenience functions to remove that boilerplate; adding macros on the library side would make using the library much more convenient.

And also strongly agree with the JSON API bit. My original attempt at this was using the Mullvad library, but it did not support everything we needed, and the description of that library as being based off reverse-engineered usage of libnftnl inside the nft binary was... concerning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback.

@mheon: I went through the very same process with nftnl-rs, which didn't have the features I needed for another project. This is the reason I created nftables-rs.

I'll look into implementing nft-like syntax through macros.

}

/// Convert a subnet into a chain name.
fn get_subnet_chain_name(subnet: IpNet, dnat: bool) -> String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of using the subnet as chain name. What is wrong with the existing hash the used in iptables?
We could also just use the network ID I think?
Is there a reason to have one chain per subnet vs chain per network? One per subnet looks incorrect to me, because nft supports both ipv4 and ipv6 at the same time I would expect one chain that matches both the v4 and v6 subnet in there. More chains should also work but doesn't this just make things more complicated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like using subnets because it reads easier in nftables output (very obvious what each chain does), is completely unambiguous (no chance of a hash collision).

I could definitely combine v4 and v6 subnets into a single chain, but I think separating them is a bit more logical. It doesn't really need more rules, either (we need the same number of rules to actually forward traffic, and separate rules to forward traffic to the chain for v4 and v6 subnets).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I need to dump me a nft list ruleset output to see what it actually looks like to better judge what I prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I really do not like the name like that, my main issue is that with the default podman network create we juts iterate trough the subnets to assign them, so it is normal to have 10.89.0.0/24, 10.89.1.0/24, 10.89.2.0/24 and so on. This makes it hard for me to follow the chains as only a single digit changes in the name. I have no problem with having the subnet as part of the name but I would prefer if lets say 12 chars of the network ID are added as suffix to make them more distinct to my eyes.

I know we don't really have to look at this often but when we get bug reports we have to and I think a more distinct name will make it easier to find problems (e.g. missing chains/rules)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Only 8 characters though (I don't want things to get ridiculously long with IPv6 addresses)

src/firewall/nft.rs Show resolved Hide resolved
@mheon mheon force-pushed the nftables branch 3 times, most recently from 11e4f4c to e9ae31c Compare January 8, 2024 20:05
@mheon
Copy link
Member Author

mheon commented Jan 8, 2024

@Luap99 One last failing test, and it's a firewalld reload one. Any idea what might be going on?

@Luap99
Copy link
Member

Luap99 commented Jan 9, 2024

Looks like you do not call into firewalld again to set the subnet trusted. (Do we even have to do that when using nftables?)
The VM is using f39 which has firewalld 2.0 which no longer flushes our custom fw rules by default, looking at your code you check that the nft rule exists then skip (continue) the subnet loop so the firewalld call is never done thus it is not trusted.

But because nftables - port forwarding ipv4 - tcp with firewalld reload passes I wonder if we even need to be trusted with nftables?

@mheon
Copy link
Member Author

mheon commented Jan 9, 2024

The firewalld folks said it was safer to do so. I'll move the firewalld make-trusted call to the top of the loop.

@mheon
Copy link
Member Author

mheon commented Jan 9, 2024

Alright, that should get tests passing. This is ready for review.

@mheon mheon force-pushed the nftables branch 3 times, most recently from 5753bdf to 77f8df8 Compare January 9, 2024 15:57
@Luap99
Copy link
Member

Luap99 commented Jan 10, 2024

I noticed these errors in the journal when testing this on my host:

Jan 10 16:44:31 pholzing-fedora /usr/bin/podman[42596]: time="2024-01-10T16:44:31+01:00" level=info msg="netavark: [DEBUG netavark::firewall::firewalld] Removing firewalld rules for IPs 10.88.0.0/16\n"
Jan 10 16:44:31 pholzing-fedora firewalld[2202]: ERROR: UNKNOWN_ERROR: rule ref count bug, missing ref count: rule_key '{"%%POLICY_SORT_KEY%%": [0, 1, "trusted", "10.88.0.0/16", "", 0, 1, "trusted", "10.88.0.0/16", "", 0, -1], "chain": "filter_FORWARD_POLICIES", "expr": [{"match": {"left": {"payload": {"field": "saddr", "protocol": "ip"}}, "op": "==", "right": {"prefix": {"addr": "10.88.0.0", "len": 16}}}}, {"match": {"left": {"payload": {"field": "daddr", "protocol": "ip"}}, "op": "==", "right": {"prefix": {"addr": "10.88.0.0", "len": 16}}}}, {"jump": {"target": "filter_FWD_policy_netavark_portfwd"}}], "family": "inet", "table": "firewalld"}'
Jan 10 16:44:31 pholzing-fedora firewalld[2202]: ERROR: COMMAND_FAILED: UNKNOWN_ERROR: rule ref count bug, missing ref count: rule_key '{"%%POLICY_SORT_KEY%%": [0, 1, "trusted", "10.88.0.0/16", "", 0, 1, "trusted", "10.88.0.0/16", "", 0, -1], "chain": "filter_FORWARD_POLICIES", "expr": [{"match": {"left": {"payload": {"field": "saddr", "protocol": "ip"}}, "op": "==", "right": {"prefix": {"addr": "10.88.0.0", "len": 16}}}}, {"match": {"left": {"payload": {"field": "daddr", "protocol": "ip"}}, "op": "==", "right": {"prefix": {"addr": "10.88.0.0", "len": 16}}}}, {"jump": {"target": "filter_FWD_policy_netavark_portfwd"}}], "family": "inet", "table": "firewalld"}'
Jan 10 16:44:31 pholzing-fedora /usr/bin/podman[42596]: time="2024-01-10T16:44:31+01:00" level=info msg="netavark: [WARN  netavark::firewall::firewalld] Error removing subnet 10.88.0.0/16 from firewalld trusted zone: org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: UNKNOWN_ERROR: rule ref count bug, missing ref count: rule_key '{\"%%POLICY_SORT_KEY%%\": [0, 1, \"trusted\", \"10.88.0.0/16\", \"\", 0, 1, \"trusted\", \"10.88.0.0/16\", \"\", 0, -1], \"chain\": \"filter_FORWARD_POLICIES\", \"expr\": [{\"match\": {\"left\": {\"payload\": {\"field\": \"saddr\", \"protocol\": \"ip\"}}, \"op\": \"==\", \"right\": {\"prefix\": {\"addr\": \"10.88.0.0\", \"len\": 16}}}}, {\"match\": {\"left\": {\"payload\": {\"field\": \"daddr\", \"protocol\": \"ip\"}}, \"op\": \"==\", \"right\": {\"prefix\": {\"addr\": \"10.88.0.0\", \"len\": 16}}}}, {\"jump\": {\"target\": \"filter_FWD_policy_netavark_portfwd\"}}], \"family\": \"inet\", \"table\": \"firewalld\"}'\n"

I don't think they are related to this PR but I have to retest with the main/fedora version. Do you see this as well?

@mheon
Copy link
Member Author

mheon commented Jan 10, 2024

Nope. Don't see anything like that in journal (F38, fully updated, Podman from main, Netavark from my branch)

@Luap99
Copy link
Member

Luap99 commented Jan 10, 2024

f39 ships with firewalld 2.0 and f38 with 1.X I think? So likely a firewalld change then. Anyway we can debug this if you get around to the firewalld stuff, doesn't seem related to this at all as I can reproduce using the fedora version with iptables.

@Luap99
Copy link
Member

Luap99 commented Jan 10, 2024

One question though:

	chain NETAVARK-HOSTPORT-DNAT {
		tcp dport 80 jump nv_10_88_0_0_nm16_dnat
		tcp dport 81 jump nv_10_88_0_0_nm16_dnat
		tcp dport 82 jump nv_10_88_0_0_nm16_dnat
	}

	chain NETAVARK-HOSTPORT-SETMARK {
		meta mark set meta mark | 0x00002000
	}

	chain nv_10_88_0_0_nm16_dnat {
		ip saddr 10.88.0.0/16 tcp dport 80 jump NETAVARK-HOSTPORT-SETMARK
		ip saddr 127.0.0.1 tcp dport 80 jump NETAVARK-HOSTPORT-SETMARK
		tcp dport 80 dnat ip to 10.88.0.6:80
		ip saddr 10.88.0.0/16 tcp dport 81 jump NETAVARK-HOSTPORT-SETMARK
		ip saddr 127.0.0.1 tcp dport 81 jump NETAVARK-HOSTPORT-SETMARK
		tcp dport 81 dnat ip to 10.88.0.7:80
		ip saddr 10.88.0.0/16 tcp dport 82 jump NETAVARK-HOSTPORT-SETMARK
		ip saddr 127.0.0.1 tcp dport 82 jump NETAVARK-HOSTPORT-SETMARK
		tcp dport 82 dnat ip to 10.88.0.8:80
	}

What is the point of the dport match in NETAVARK-HOSTPORT-DNAT? The ..._dnat chain has to match the dport again anyway so just a direct jump should be easier?

@mheon
Copy link
Member Author

mheon commented Jan 10, 2024

Comments addressed (added network ID to subnet chain names). Should be ready for merge.

I'm on the firewalld stuff now, so clearly I need to do a dist-upgrade to f39 and see how badly we're broken (and what fixing it will mean - I really don't want to have to conventionalize on firewalld version)

@mheon
Copy link
Member Author

mheon commented Jan 10, 2024

I think that jump is mostly to prevent the NETAVARK-HOSTPORT-DNAT chain small. We could eliminate the nv_10_88_0_0_nm16_dnat chain entirely and stuff everything into NETAVARK-HOSTPORT-DNAT but that would get long and messy (especially if one container decides to do something like a 1000-port range forward). And so long as we have per-network DNAT chains, might as well only jump to them if we actually have to.

@mheon mheon force-pushed the nftables branch 2 times, most recently from be3f504 to 3a067df Compare January 10, 2024 17:18
@Luap99
Copy link
Member

Luap99 commented Jan 10, 2024

Ok I assumed a port range would actually only result is one rule not in one per port but seems like such feature (https://bugzilla.netfilter.org/show_bug.cgi?id=1501) is not yet implement or very new, not sure what the status is (finding out if patches on ML are merged is always such a chore)

@mheon
Copy link
Member Author

mheon commented Jan 10, 2024

If it is merged, it's not in the JSON schema yet, so we can't make use of it. It's unfortunate because it's the only place ranges don't work.

RE: Firewalld - The release notes say 2.0 has no breaking changes. I'm looking deeper.

@@ -35,6 +37,8 @@ pub struct TearDownNetwork {
pub struct PortForwardConfigGeneric<Ports, IpAddresses> {
/// id of container
pub container_id: String,
/// id of the network
pub id: String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I just remembered the json parser is very strict and fields are required by default so the netavark firewalld reload service would fail to parse the config files created with the old version, not a huge deal but I think just adding #[serde(default)] to both fields might solve it. The string would be empty but given that iptables code does not use it that would be fine.

Also for consistency I think network_id is a better name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@mheon mheon force-pushed the nftables branch 2 times, most recently from c8c548a to 2a722d4 Compare January 10, 2024 17:40
Adds an nftables firewall backend and tests for said backend.
Implements basic forwarding, port forwarding, and teardown for
all relevant rules. Heavily based on our existing iptables driver
but with a number of improvements (we live in a dedicated table,
so this should play much more nicely with other tools using the
firewall; IPv4 and IPv6 share a table and almost all code; and
rule structure is a bit simpler because we do have our own table
and don't have to worry about cluttering up the FORWARD chain,
we'll the the only ones using it.

This implementation presently does not support isolation; that
will be added in a followon.

Fixes containers#816

Signed-off-by: Matthew Heon <[email protected]>
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming tests pass, nice work. I checked the simple bridge example with #884 and see a 4-5x speedup with nftables compared to iptables.

/lgtm
/hold

Copy link
Contributor

openshift-ci bot commented Jan 10, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99, mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mheon
Copy link
Member Author

mheon commented Jan 10, 2024

/hold cancel

@openshift-merge-bot openshift-merge-bot bot merged commit 5366e8f into containers:main Jan 10, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nftables support
3 participants