dpif-netdev: Forwarding optimization for flows with a simple match.

There are cases where users might want simple forwarding or drop rules for all packets received from a specific port, e.g :: "in_port=1,actions=2" "in_port=2,actions=IN_PORT" "in_port=3,vlan_tci=0x1234/0x1fff,actions=drop" "in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3" There are also cases where complex OpenFlow rules can be simplified down to datapath flows with very simple match criteria. In theory, for very simple forwarding, OVS doesn't need to parse packets at all in order to follow these rules. "Simple match" lookup optimization is intended to speed up packet forwarding in these cases. Design: Due to various implementation constraints userspace datapath has following flow fields always in exact match (i.e. it's required to match at least these fields of a packet even if the OF rule doesn't need that): - recirc_id - in_port - packet_type - dl_type - vlan_tci (CFI + VID) - in most cases - nw_frag - for ip packets Not all of these fields are related to packet itself. We already know the current 'recirc_id' and the 'in_port' before starting the packet processing. It also seems safe to assume that we're working with Ethernet packets. So, for the simple OF rule we need to match only on 'dl_type', 'vlan_tci' and 'nw_frag'. 'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be combined in a single 64bit integer (mark) that can be used as a hash in hash map. We are using only VID and CFI form the 'vlan_tci', flows that need to match on PCP will not qualify for the optimization. Workaround for matching on non-existence of vlan updated to match on CFI and VID only in order to qualify for the optimization. CFI is always set by OVS if vlan is present in a packet, so there is no need to match on PCP in this case. 'nw_frag' takes 2 bits of PCP inside the simple match mark. New per-PMD flow table 'simple_match_table' introduced to store simple match flows only. 'dp_netdev_flow_add' adds flow to the usual 'flow_table' and to the 'simple_match_table' if the flow meets following constraints: - 'recirc_id' in flow match is 0. - 'packet_type' in flow match is Ethernet. - Flow wildcards contains only minimal set of non-wildcarded fields (listed above). If the number of flows for current 'in_port' in a regular 'flow_table' equals number of flows for current 'in_port' in a 'simple_match_table', we may use simple match optimization, because all the flows we have are simple match flows. This means that we only need to parse 'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching. Now we make the unique flow mark from the 'in_port', 'dl_type', 'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'. On successful lookup we don't need to run full 'miniflow_extract()'. Unsuccessful lookup technically means that we have no suitable flow in the datapath and upcall will be required. So, in this case EMC and SMC lookups are disabled. We may optimize this path in the future by bypassing the dpcls lookup too. Performance improvement of this solution on a 'simple match' flows should be comparable with partial HW offloading, because it parses same packet fields and uses similar flow lookup scheme. However, unlike partial HW offloading, it works for all port types including virtual ones. Performance results when compared to EMC: Test setup: virtio-user OVS virtio-user Testpmd1 ------------> pmd1 ------------> Testpmd2 (txonly) x<------ pmd2 <------------ (mac swap) Single stream of 64byte packets. Actions: in_port=vhost0,actions=vhost1 in_port=vhost1,actions=vhost0 Stats collected from pmd1 and pmd2, so there are 2 scenarios: Virt-to-Virt : Testpmd1 ------> pmd1 ------> Testpmd2. Virt-to-NoCopy : Testpmd2 ------> pmd2 --->x Testpmd1. Here the packet sent from pmd2 to Testpmd1 is always dropped, because the virtqueue is full since Testpmd1 is in txonly mode and doesn't receive any packets. This should be closer to the performance of a VM-to-Phy scenario. Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz. Table below represents improvement in throughput when compared to EMC. +----------------+------------------------+------------------------+ | | Default (-g -O2) | "-Ofast -march=native" | | Scenario +------------+-----------+------------+-----------+ | | GCC | Clang | GCC | Clang | +----------------+------------+-----------+------------+-----------+ | Virt-to-Virt | +18.9% | +25.5% | +10.8% | +16.7% | | Virt-to-NoCopy | +24.3% | +33.7% | +14.9% | +22.0% | +----------------+------------+-----------+------------+-----------+ For Phy-to-Phy case performance improvement should be even higher, but it's not the main use-case for this functionality. Performance difference for the non-simple flows is within a margin of error. Acked-by: Sriharsha Basavapatna <[email protected]> Signed-off-by: Ilya Maximets <[email protected]>
igsilya · Jan 7, 2022 · e7e9973 · e7e9973
1 parent 46d44cf
commit e7e9973
Show file tree

Hide file tree

Showing 15 changed files with 386 additions and 83 deletions.
diff --git a/Documentation/topics/dpdk/bridge.rst b/Documentation/topics/dpdk/bridge.rst
@@ -81,6 +81,30 @@ using the following command::
 
     $ ovs-vsctl get Interface <iface> statistics
 
+Simple Match Lookup
+-------------------
+
+There are cases where users might want simple forwarding or drop rules for all
+packets received from a specific port, e.g ::
+
+    in_port=1,actions=2
+    in_port=2,actions=IN_PORT
+    in_port=3,vlan_tci=0x1234/0x1fff,actions=drop
+    in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3
+
+There are also cases where complex OpenFlow rules can be simplified down to
+datapath flows with very simple match criteria.
+
+In theory, for very simple forwarding, OVS doesn't need to parse packets at all
+in order to follow these rules.  In practice, due to various implementation
+constraints, userspace datapath has to match at least on a small set of packet
+fileds.  Some matching criteria (for example, ingress port) are not related to
+the packet itself and others (for example, VLAN tag or Ethernet type) can be
+extracted without fully parsing the packet.  This allows OVS to significantly
+speed up packet forwarding for these flows with simple match criteria.
+Statistics on the number of packets matched in this way can be found in a
+`simple match hits` counter of `ovs-appctl dpif-netdev/pmd-stats-show` command.
+
 EMC Insertion Probability
 -------------------------
 

diff --git a/NEWS b/NEWS
@@ -1,5 +1,8 @@
 Post-v2.16.0
 ---------------------
+   - Userspace datapath:
+     * Optimized flow lookups for datapath flows with simple match criteria.
+       See 'Simple Match Lookup' in Documentation/topics/dpdk/bridge.rst.
    - DPDK:
      * EAL argument --socket-mem is no longer configured by default upon
        start-up.  If dpdk-socket-mem and dpdk-alloc-mem are not specified,

diff --git a/lib/dpif-netdev-avx512.c b/lib/dpif-netdev-avx512.c
@@ -198,7 +198,8 @@ dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
                 if (mfex_hit) {
                     pkt_meta[i].tcp_flags = miniflow_get_tcp_flags(&key->mf);
                 } else {
-                    pkt_meta[i].tcp_flags = parse_tcp_flags(packet);
+                    pkt_meta[i].tcp_flags = parse_tcp_flags(packet,
+                                                            NULL, NULL, NULL);
                 }
 
                 pkt_meta[i].bytes = dp_packet_size(packet);

diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
@@ -232,10 +232,10 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
     uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
 
     ds_put_format(str,
-            "  Iterations:        %12"PRIu64"  (%.2f us/it)\n"
-            "  - Used TSC cycles: %12"PRIu64"  (%5.1f %% of total cycles)\n"
-            "  - idle iterations: %12"PRIu64"  (%5.1f %% of used cycles)\n"
-            "  - busy iterations: %12"PRIu64"  (%5.1f %% of used cycles)\n",
+            "  Iterations:         %12"PRIu64"  (%.2f us/it)\n"
+            "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
+            "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
+            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
             tot_iter, tot_cycles * us_per_cycle / tot_iter,
             tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz,
             idle_iter,
@@ -244,23 +244,26 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
             100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
     if (rx_packets > 0) {
         ds_put_format(str,
-            "  Rx packets:        %12"PRIu64"  (%.0f Kpps, %.0f cycles/pkt)\n"
-            "  Datapath passes:   %12"PRIu64"  (%.2f passes/pkt)\n"
-            "  - PHWOL hits:      %12"PRIu64"  (%5.1f %%)\n"
-            "  - MFEX Opt hits:   %12"PRIu64"  (%5.1f %%)\n"
-            "  - EMC hits:        %12"PRIu64"  (%5.1f %%)\n"
-            "  - SMC hits:        %12"PRIu64"  (%5.1f %%)\n"
-            "  - Megaflow hits:   %12"PRIu64"  (%5.1f %%, %.2f "
-                                                "subtbl lookups/hit)\n"
-            "  - Upcalls:         %12"PRIu64"  (%5.1f %%, %.1f us/upcall)\n"
-            "  - Lost upcalls:    %12"PRIu64"  (%5.1f %%)\n",
+            "  Rx packets:         %12"PRIu64"  (%.0f Kpps, %.0f cycles/pkt)\n"
+            "  Datapath passes:    %12"PRIu64"  (%.2f passes/pkt)\n"
+            "  - PHWOL hits:       %12"PRIu64"  (%5.1f %%)\n"
+            "  - MFEX Opt hits:    %12"PRIu64"  (%5.1f %%)\n"
+            "  - Simple Match hits:%12"PRIu64"  (%5.1f %%)\n"
+            "  - EMC hits:         %12"PRIu64"  (%5.1f %%)\n"
+            "  - SMC hits:         %12"PRIu64"  (%5.1f %%)\n"
+            "  - Megaflow hits:    %12"PRIu64"  (%5.1f %%, %.2f "
+                                                 "subtbl lookups/hit)\n"
+            "  - Upcalls:          %12"PRIu64"  (%5.1f %%, %.1f us/upcall)\n"
+            "  - Lost upcalls:     %12"PRIu64"  (%5.1f %%)\n",
             rx_packets, (rx_packets / duration) / 1000,
             1.0 * stats[PMD_CYCLES_ITER_BUSY] / rx_packets,
             passes, rx_packets ? 1.0 * passes / rx_packets : 0,
             stats[PMD_STAT_PHWOL_HIT],
             100.0 * stats[PMD_STAT_PHWOL_HIT] / passes,
             stats[PMD_STAT_MFEX_OPT_HIT],
             100.0 * stats[PMD_STAT_MFEX_OPT_HIT] / passes,
+            stats[PMD_STAT_SIMPLE_HIT],
+            100.0 * stats[PMD_STAT_SIMPLE_HIT] / passes,
             stats[PMD_STAT_EXACT_HIT],
             100.0 * stats[PMD_STAT_EXACT_HIT] / passes,
             stats[PMD_STAT_SMC_HIT],
@@ -275,16 +278,18 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
             stats[PMD_STAT_LOST],
             100.0 * stats[PMD_STAT_LOST] / passes);
     } else {
-        ds_put_format(str, "  Rx packets:        %12d\n", 0);
+        ds_put_format(str,
+            "  Rx packets:         %12d\n", 0);
     }
     if (tx_packets > 0) {
         ds_put_format(str,
-            "  Tx packets:        %12"PRIu64"  (%.0f Kpps)\n"
-            "  Tx batches:        %12"PRIu64"  (%.2f pkts/batch)\n",
+            "  Tx packets:         %12"PRIu64"  (%.0f Kpps)\n"
+            "  Tx batches:         %12"PRIu64"  (%.2f pkts/batch)\n",
             tx_packets, (tx_packets / duration) / 1000,
             tx_batches, 1.0 * tx_packets / tx_batches);
     } else {
-        ds_put_format(str, "  Tx packets:        %12d\n\n", 0);
+        ds_put_format(str,
+            "  Tx packets:         %12d\n\n", 0);
     }
 }
 

diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
@@ -58,6 +58,7 @@ extern "C" {
 enum pmd_stat_type {
     PMD_STAT_PHWOL_HIT,     /* Packets that had a partial HWOL hit (phwol). */
     PMD_STAT_MFEX_OPT_HIT,  /* Packets that had miniflow optimized match. */
+    PMD_STAT_SIMPLE_HIT,    /* Packets that had a simple match hit. */
     PMD_STAT_EXACT_HIT,     /* Packets that had an exact match (emc). */
     PMD_STAT_SMC_HIT,       /* Packets that had a sig match hit (SMC). */
     PMD_STAT_MASKED_HIT,    /* Packets that matched in the flow table. */

diff --git a/lib/dpif-netdev-private-flow.h b/lib/dpif-netdev-private-flow.h
@@ -87,6 +87,8 @@ struct dp_netdev_flow {
     /* Hash table index by unmasked flow. */
     const struct cmap_node node; /* In owning dp_netdev_pmd_thread's */
                                  /* 'flow_table'. */
+    const struct cmap_node simple_match_node; /* In dp_netdev_pmd_thread's
+                                                 'simple_match_table'. */
     const struct cmap_node mark_node; /* In owning flow_mark's mark_to_flow */
     const ovs_u128 ufid;         /* Unique flow identifier. */
     const ovs_u128 mega_ufid;    /* Unique mega flow identifier. */
@@ -100,7 +102,8 @@ struct dp_netdev_flow {
     struct ovs_refcount ref_cnt;
 
     bool dead;
-    uint32_t mark;               /* Unique flow mark assigned to a flow */
+    uint32_t mark;               /* Unique flow mark for netdev offloading. */
+    uint64_t simple_match_mark;  /* Unique flow mark for the simple match. */
 
     /* Statistics. */
     struct dp_netdev_flow_stats stats;

diff --git a/lib/dpif-netdev-private-thread.h b/lib/dpif-netdev-private-thread.h
@@ -26,6 +26,7 @@
 #include <stdbool.h>
 #include <stdint.h>
 
+#include "ccmap.h"
 #include "cmap.h"
 
 #include "dpif-netdev-private-dfc.h"
@@ -86,12 +87,18 @@ struct dp_netdev_pmd_thread {
 
     /* Flow-Table and classifiers
      *
-     * Writers of 'flow_table' must take the 'flow_mutex'.  Corresponding
-     * changes to 'classifiers' must be made while still holding the
-     * 'flow_mutex'.
+     * Writers of 'flow_table'/'simple_match_table' and their n* ccmap's must
+     * take the 'flow_mutex'.  Corresponding changes to 'classifiers' must be
+     * made while still holding the 'flow_mutex'.
      */
     struct ovs_mutex flow_mutex;
     struct cmap flow_table OVS_GUARDED; /* Flow table. */
+    struct cmap simple_match_table OVS_GUARDED; /* Flow table with simple
+                                                   match flows only. */
+    /* Number of flows in the 'flow_table' per in_port. */
+    struct ccmap n_flows OVS_GUARDED;
+    /* Number of flows in the 'simple_match_table' per in_port. */
+    struct ccmap n_simple_flows OVS_GUARDED;
 
     /* One classifier per in_port polled by the pmd */
     struct cmap classifiers;

diff --git a/lib/dpif-netdev-unixctl.man b/lib/dpif-netdev-unixctl.man
@@ -11,10 +11,11 @@ Shows performance statistics for one or all pmd threads of the datapath
 \fIdp\fR. The special thread "main" sums up the statistics of every non pmd
 thread.
 
-The sum of "emc hits", "smc hits", "megaflow hits" and "miss" is the number of
-packet lookups performed by the datapath. Beware that a recirculated packet
-experiences one additional lookup per recirculation, so there may be
-more lookups than forwarded packets in the datapath.
+The sum of "phwol hits", "simple match hits", "emc hits", "smc hits",
+"megaflow hits" and "miss" is the number of packet lookups performed by the
+datapath. Beware that a recirculated packet experiences one additional lookup
+per recirculation, so there may be more lookups than forwarded packets in the
+datapath.
 
 The MFEX Opt hits displays the number of packets that are processed by the
 optimized miniflow extract implementations.
@@ -140,8 +141,9 @@ pmd thread numa_id 0 core_id 1:
   Datapath passes:        3599415  (1.50 passes/pkt)
   - PHWOL hits:                 0  (  0.0 %)
   - MFEX Opt hits:        3570133  ( 99.2 %)
+  - Simple Match hits:          0  (  0.0 %)
   - EMC hits:              336472  (  9.3 %)
-  - SMC hits:                   0  ( 0.0 %)
+  - SMC hits:                   0  (  0.0 %)
   - Megaflow hits:        3262943  ( 90.7 %, 1.00 subtbl lookups/hit)
   - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
   - Lost upcalls:               0  (  0.0 %)