Investigate how two dpservice instances can be run in parallel #643

guvenc · 2025-02-04T17:13:26Z

Investigate whether it is possible to run two identical dpservice instances in parallel without disturbing each other.
This is one of the necessary steps to be able to update a single dpservice instance without downtime,

PlagueCZ · 2025-02-07T16:51:54Z

Known issues to resolve

Need to run two instances in parallel
Need to only process packets by one of them (maybe process new flows in new one, old flows in old one?)
Need to orchestrate them to the same state
Need to pass control from one to the other (and then stop the original one)

Two instances of dpservice in parallel + orchestration

The first idea was to somehow secondary process(es) but this it not usable because there can always only be one primary process, thus there would be no way to get the new dpservice back to "primary status"

What does work though is to run two primary processes with a different hugepage-memory (via --file-prefix).

Issues

[SOLVED] all secondary processes to also use this argument to work properly (dpservice-dump, dpservice-inspect...)
[SOLVED] metalnet needs to connect to a different gRPC port (already capable of)

Passing control

There needs to be a way of stopping dpservice from processing packets. This way the new one can come up in an "idle" state, get reconciled and then simply start processing, while the other stops.

[SOLVED] I simply added gRPC call to start/stop processing and dpservice can start with stopped processing by default. This is done in rx_node because the rest of the graph needs to work due to gRPC being processed by the graph.

Currently this is just a hack with a special command-line arugment and reusal of some other gRPC command, but that's trivial to finish.

Passing state

Orchestration is not enough, we also need to pass connection tracking, etc.

[POSSIBLE] In principle this is easy. "Just" share some memory or open a connection between processes and transfer data. It will be hard to do properly, but not really complex.

Packet processing

We need to only process packets by one instance of dpservice. This is prevented by mlx5 that is duplicating packets when two instances are running. Unfortunately the duplication happens after rte_eth_dev_start() which is essential for orchestration.

[POSSIBLE] In DPDK, there is a single place where an rte_flow is installed and causes this packet duplication. While not trivial, this can be patched so that we can remove the rule and install it "later" (connected to the above "passing control"). The place in DPDK is mlx5_trigger.c:1284 calling mlx5_traffic_enable().

Currently though I am skipping this step as actually traffic can survive packet duplication it seems (my guess is the switches/kernels dropping duplicates).

Next steps

As I am intentionally not finishing any step (to minimize work during reserach), the last big thing is to test in proper traffic situation in OSC and verify how big of a problem packet duplication is.

There is also a small problem with stopping the old instance, which causes metalnet to unsubscribe everything in the new instance (my hypothesis, no details now), this should be relatively trivial to solve .

There is also the fact that two dpservice share CPU and in polling mode we halve the processing power of both. But this is again easily solvable by slowing down the "idle" instance in the graph loop.

guvenc added the enhancement New feature or request label Feb 4, 2025

guvenc assigned PlagueCZ Feb 4, 2025

guvenc moved this to In Progress in Networking Feb 4, 2025

guvenc added this to Networking Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate how two dpservice instances can be run in parallel #643

Investigate how two dpservice instances can be run in parallel #643

guvenc commented Feb 4, 2025

PlagueCZ commented Feb 7, 2025

Investigate how two dpservice instances can be run in parallel #643

Investigate how two dpservice instances can be run in parallel #643

Comments

guvenc commented Feb 4, 2025

PlagueCZ commented Feb 7, 2025

Known issues to resolve

Two instances of dpservice in parallel + orchestration

Issues

Passing control

Passing state

Packet processing

Next steps