-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network struct reorganize #4
base: linux-6.6.y
Are you sure you want to change the base?
Conversation
mainline inclusion from mainline-v6.8-rc1 category: performance Analyzed a few structs in the networking stack by looking at variables within them that are used in the TCP/IP fast path. Fast path is defined as TCP path where data is transferred from sender to receiver unidirectionally. It doesn't include phases other than TCP_ESTABLISHED, nor does it look at error paths. We hope to re-organizing variables that span many cachelines whose fast path variables are also spread out, and this document can help future developers keep networking fast path cachelines small. Optimized_cacheline field is computed as (Fastpath_Bytes/L3_cacheline_size_x86), and not the actual organized results (see patches to come for these). Investigation is done on 6.5 Name Struct_Cachelines Cur_fastpath_cache Fastpath_Bytes Optimized_cacheline tcp_sock 42 (2664 Bytes) 12 396 8 net_device 39 (2240 bytes) 12 234 4 inet_sock 15 (960 bytes) 14 922 14 Inet_connection_sock 22 (1368 bytes) 18 1166 18 Netns_ipv4 (sysctls) 12 (768 bytes) 4 77 2 linux_mib 16 (1060) 6 104 2 Note how there isn't much improvement space for inet_sock and Inet_connection_sock because sk and icsk_inet respectively takes up so much of the struct that rest of the variables become a small portion of the struct size. So, we decided to reorganize tcp_sock, net_device, netns_ipv4 Suggested-by: Eric Dumazet <[email protected]> Signed-off-by: Coco Li <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 14006f1) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.8-rc1 category: performance Set up build time warnings to safeguard against future header changes of organized structs. Warning includes: 1) whether all variables are still in the same cache group 2) whether all the cache groups have the sum of the members size (in the maximum condition, including all members defined in configs) The __cache_group* variables are ignored in kernel-doc check in the various header files they appear in to enforce the cache groups. Suggested-by: Daniel Borkmann <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: Coco Li <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit aeb9ce0)
mainline inclusion from mainline-v6.8-rc1 category: performance Reorganize fast path variables on tx-txrx-rx order. Fastpath cacheline ends after sysctl_tcp_rmem. There are only read-only variables here. (write is on the control path and not considered in this case) Below data generated with pahole on x86 architecture. Fast path variables span cache lines before change: 4 Fast path variables span cache lines after change: 2 Suggested-by: Eric Dumazet <[email protected]> Reviewed-by: Wei Wang <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Coco Li <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 18fd64d)
mainline inclusion from mainline-v6.8-rc1 category: performance Reorganize fast path variables on tx-txrx-rx order Fastpath variables end after npinfo. Below data generated with pahole on x86 architecture. Fast path variables span cache lines before change: 12 Fast path variables span cache lines after change: 4 Suggested-by: Eric Dumazet <[email protected]> Signed-off-by: Coco Li <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> (cherry picked from commit 43a71cd) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.8-rc1 category: performance The variables are organized according in the following way: - TX read-mostly hotpath cache lines - TXRX read-mostly hotpath cache lines - RX read-mostly hotpath cache lines - TX read-write hotpath cache line - TXRX read-write hotpath cache line - RX read-write hotpath cache line Fastpath cachelines end after rcvq_space. Cache line boundaries are enforced only between read-mostly and read-write. That is, if read-mostly tx cachelines bleed into read-mostly txrx cachelines, we do not care. We care about the boundaries between read and write cachelines because we want to prevent false sharing. Fast path variables span cache lines before change: 12 Fast path variables span cache lines after change: 8 Suggested-by: Eric Dumazet <[email protected]> Reviewed-by: Wei Wang <[email protected]> Signed-off-by: Coco Li <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> (cherry picked from commit d5fed5a) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.8-rc1 category: performance dev->gso_partial_features is read from tx fast path for GSO packets. Move it to appropriate section to avoid a cache line miss. Fixes: 43a71cd ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <[email protected]> Cc: Coco Li <[email protected]> Cc: David Ahern <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 993498e)
mainline inclusion from mainline-v6.8-rc1 category: performance xdp_prog is used in receive path, both from XDP enabled drivers and from netif_elide_gro(). This patch also removes two 4-bytes holes. Fixes: 43a71cd ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <[email protected]> Cc: Coco Li <[email protected]> Cc: Simon Horman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> (cherry picked from commit d3d344a)
mainline inclusion from mainline-v6.8-rc5 category: performance tp->scaling_ratio is a read mostly field, used in rx and tx fast paths. Fixes: d5fed5a ("tcp: reorganize tcp_sock fast path variables") Signed-off-by: Eric Dumazet <[email protected]> Cc: Coco Li <[email protected]> Cc: Wei Wang <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 119ff04)
mainline inclusion from mainline-v6.8-rc5 category: performance [gwt bp tip]: its field is reserved for future tp->tcp_usec_ts is a read mostly field, used in rx and tx fast paths. Fixes: d5fed5a ("tcp: reorganize tcp_sock fast path variables") Signed-off-by: Eric Dumazet <[email protected]> Cc: Coco Li <[email protected]> Cc: Wei Wang <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 666a877)
mainline inclusion from mainline-v6.8-rc5 category: performance dev->lstats is notably used from loopback ndo_start_xmit() and other virtual drivers. Per cpu stats updates are dirtying per-cpu data, but the pointer itself is read-only. Fixes: 43a71cd ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <[email protected]> Cc: Coco Li <[email protected]> Cc: Simon Horman <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit c353c7b)
mainline inclusion from mainline-v6.9-rc1 category: performance dev->state can be read in rx and tx fast paths. netif_running() which needs dev->state is called from - enqueue_to_backlog() [RX path] - __dev_direct_xmit() [TX path] Fixes: 43a71cd ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <[email protected]> Cc: Coco Li <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]> (cherry picked from commit f6e0a49)
mainline inclusion from mainline-v6.13-rc1 category: performance Commit d5fed5a ("tcp: reorganize tcp_sock fast path variables") moved the fields around and misplaced the documentation for "lsndtime". So, let's replace it in the proper place. Signed-off-by: Menglong Dong <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> (cherry picked from commit 0a2cdeeae9ddc14d752173be6af98bc9fb45c6ad)
Reviewer's Guide by SourceryThis pull request reorganizes several networking structures to optimize cacheline usage by grouping related fields together. The changes include splitting the tcp_sock, net_device, and netns_ipv4 structures into cacheline groups marked by new macros, adding compile-time assertions to check proper cacheline alignment and sizes, and updating associated documentation and helper macros. Updated class diagram for tcp_sock with cacheline groupsclassDiagram
class tcp_sock {
<<cacheline group: tcp_sock_read_tx>>
+ max_window: u32
+ rcv_ssthresh: u32
+ reordering: u32
+ notsent_lowat: u32
+ gso_segs: u16
+ lost_skb_hint: sk_buff*
+ retransmit_skb_hint: sk_buff*
<<cacheline group: tcp_sock_read_txrx>>
+ tsoffset: u32
+ snd_wnd: u32
+ mss_cache: u32
+ snd_cwnd: u32
+ prr_out: u32
+ lost_out: u32
+ sacked_out: u32
+ tcp_header_len: u16
<<cacheline group: tcp_sock_read_rx>>
+ copied_seq: u32
+ rcv_tstamp: u32
+ snd_wl1: u32
+ tlp_high_seq: u32
+ rttvar_us: u32
<<cacheline group: tcp_sock_write_tx>>
+ segs_out: u32
+ data_segs_out: u32
+ bytes_sent: u64
<<cacheline group: tcp_sock_write_txrx>>
+ pred_flags: __be32
+ rcv_nxt: u32
+ snd_nxt: u32
+ snd_una: u32
+ window_clamp: u32
+ srtt_us: u32
<<cacheline group: tcp_sock_write_rx>>
+ bytes_received: u64
+ segs_in: u32
+ data_segs_in: u32
+ rcv_wup: u32
}
Updated class diagram for net_device with cacheline groupsclassDiagram
class net_device {
<<cacheline group: net_device_read_tx>>
+ priv_flags: unsigned long long
+ netdev_ops: const struct net_device_ops*
+ header_ops: const struct header_ops*
+ _tx: netdev_queue*
+ mtu: unsigned int
<<cacheline group: net_device_read_txrx>>
+ lstats/tstats/dstats: pointer
+ state: unsigned long
+ flags: unsigned int
+ hard_header_len: unsigned short
+ features: netdev_features_t
+ ip6_ptr: inet6_dev*
<<cacheline group: net_device_read_rx>>
+ xdp_prog: bpf_prog*
+ ptype_specific: list_head
+ ifindex: int
+ real_num_rx_queues: unsigned int
+ _rx: netdev_rx_queue*
}
Updated class diagram for netns_ipv4 with cacheline groupsclassDiagram
class netns_ipv4 {
<<cacheline group: netns_ipv4_read_tx>>
+ sysctl_tcp_early_retrans: u8
+ sysctl_tcp_tso_win_divisor: u8
+ sysctl_tcp_tso_rtt_log: u8
+ sysctl_tcp_autocorking: u8
+ sysctl_tcp_min_snd_mss: int
+ sysctl_tcp_notsent_lowat: unsigned int
<<cacheline group: netns_ipv4_read_txrx>>
+ sysctl_tcp_moderate_rcvbuf: u8
<<cacheline group: netns_ipv4_read_rx>>
+ sysctl_ip_early_demux: u8
+ sysctl_tcp_early_demux: u8
+ sysctl_tcp_reordering: int
+ sysctl_tcp_rmem: int[3]
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @opsiff - I've reviewed your changes - here's some feedback:
Overall Comments:
- It would be good to add a script to automatically check the cacheline layout.
- Consider using
__cacheline_aligned
instead of____cacheline_aligned
for better readability.
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 2 issues found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
@@ -175,23 +175,122 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req) | |||
#define TCP_RMEM_TO_WIN_SCALE 8 | |||
|
|||
struct tcp_sock { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider using dedicated sub-structures for cacheline groups to improve readability and reduce repetitive markers.
Consider grouping related fields into separate sub-structures rather than using many inline __cacheline_group markers. For example, instead of:
__cacheline_group_begin(tcp_sock_read_tx);
u32 max_window;
u32 rcv_ssthresh;
u32 reordering;
u32 notsent_lowat;
u16 gso_segs;
struct sk_buff *lost_skb_hint;
struct sk_buff *retransmit_skb_hint;
__cacheline_group_end(tcp_sock_read_tx);
you could define a dedicated sub-structure and include it in the parent, like:
struct tcp_sock_read_tx {
u32 max_window; /* Maximal window ever seen from peer */
u32 rcv_ssthresh; /* Current window clamp */
u32 reordering; /* Packet reordering metric */
u32 notsent_lowat; /* TCP_NOTSENT_LOWAT */
u16 gso_segs; /* GSO segments */
struct sk_buff *lost_skb_hint;
struct sk_buff *retransmit_skb_hint;
};
struct tcp_sock {
struct inet_connection_sock inet_conn;
__cacheline_group_begin(tcp_sock_read_tx);
struct tcp_sock_read_tx read_tx;
__cacheline_group_end(tcp_sock_read_tx);
/* other cacheline groups... */
};
This modularizes the layout, makes the structure easier to read, and reduces repetitive markers. Similar refactoring can be applied to other cacheline groups.
@@ -11578,6 +11578,63 @@ static struct pernet_operations __net_initdata default_device_ops = { | |||
.exit_batch = default_device_exit_batch, | |||
}; | |||
|
|||
static void __init net_dev_struct_check(void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider abstracting the repetitive CACHELINE_ASSERT_GROUP_MEMBER calls into a helper macro or function.
Consider abstracting these repetitive assertions into a helper macro so that you list the fields just once. For example, you could define a macro that maps a list of member names into individual assertions. One approach is:
/* Helper macro to call the assertion on a field */
#define CACHELINE_ASSERT_FIELD(struct_type, group, field) \
CACHELINE_ASSERT_GROUP_MEMBER(struct_type, group, field)
/* Helper to apply the above macro on each field in a simple list;
* since C99 doesn’t have a standard MAP macro, you can unroll a small one:
*/
#define CACHELINE_ASSERT_FIELDS_TX(...) \
do { \
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, __VA_ARGS__); \
} while(0)
If the number of fields is fixed, you might “unroll” them:
static void __init net_dev_struct_check(void)
{
/* TX read-mostly hotpath */
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, priv_flags);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, netdev_ops);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, header_ops);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, _tx);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, real_num_tx_queues);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, gso_max_size);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, gso_ipv4_max_size);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, gso_max_segs);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, gso_partial_features);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, num_tc);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, mtu);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, needed_headroom);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, tc_to_txq);
#ifdef CONFIG_XPS
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, xps_maps);
#endif
#ifdef CONFIG_NETFILTER_EGRESS
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, nf_hooks_egress);
#endif
#ifdef CONFIG_NET_XGRESS
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_tx, tcx_egress);
#endif
CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_tx, 160);
/* TXRX read-mostly hotpath */
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_txrx, lstats);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_txrx, state);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_txrx, flags);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_txrx, hard_header_len);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_txrx, features);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_txrx, ip6_ptr);
CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_txrx, 46);
/* RX read-mostly hotpath */
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, ptype_specific);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, ifindex);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, real_num_rx_queues);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, _rx);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, gro_flush_timeout);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, napi_defer_hard_irqs);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, gro_max_size);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, gro_ipv4_max_size);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, rx_handler);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, rx_handler_data);
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, nd_net);
#ifdef CONFIG_NETPOLL
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, npinfo);
#endif
#ifdef CONFIG_NET_XGRESS
CACHELINE_ASSERT_FIELD(struct net_device, net_device_read_rx, tcx_ingress);
#endif
CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 104);
}
If you wish to further reduce the repetition, consider building a small configuration table for each group and iterating over it. For example, you might create an array of field names or offsets (if the macros can work with computed offsets) and then loop through the array, invoking the assertion for each entry.
These changes encapsulate the repetitive logic. They preserve functionality while reducing the file’s complexity and making future modifications easier.
[ Upstream commit c7b87ce0dd10b64b68a0b22cb83bbd556e28fe81 ] libtraceevent parses and returns an array of argument fields, sometimes larger than RAW_SYSCALL_ARGS_NUM (6) because it includes "__syscall_nr", idx will traverse to index 6 (7th element) whereas sc->fmt->arg holds 6 elements max, creating an out-of-bounds access. This runtime error is found by UBsan. The error message: $ sudo UBSAN_OPTIONS=print_stacktrace=1 ./perf trace -a --max-events=1 builtin-trace.c:1966:35: runtime error: index 6 out of bounds for type 'syscall_arg_fmt [6]' #0 0x5c04956be5fe in syscall__alloc_arg_fmts /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:1966 #1 0x5c04956c0510 in trace__read_syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2110 #2 0x5c04956c372b in trace__syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2436 #3 0x5c04956d2f39 in trace__init_syscalls_bpf_prog_array_maps /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:3897 #4 0x5c04956d6d25 in trace__run /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:4335 deepin-community#5 0x5c04956e112e in cmd_trace /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:5502 deepin-community#6 0x5c04956eda7d in run_builtin /home/howard/hw/linux-perf/tools/perf/perf.c:351 deepin-community#7 0x5c04956ee0a8 in handle_internal_command /home/howard/hw/linux-perf/tools/perf/perf.c:404 deepin-community#8 0x5c04956ee37f in run_argv /home/howard/hw/linux-perf/tools/perf/perf.c:448 deepin-community#9 0x5c04956ee8e9 in main /home/howard/hw/linux-perf/tools/perf/perf.c:556 deepin-community#10 0x79eb3622a3b7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 deepin-community#11 0x79eb3622a47a in __libc_start_main_impl ../csu/libc-start.c:360 deepin-community#12 0x5c04955422d4 in _start (/home/howard/hw/linux-perf/tools/perf/perf+0x4e02d4) (BuildId: 5b6cab2d59e96a4341741765ad6914a4d784dbc6) 0.000 ( 0.014 ms): Chrome_ChildIO/117244 write(fd: 238, buf: !, count: 1) = 1 Fixes: 5e58fcf ("perf trace: Allow allocating sc->arg_fmt even without the syscall tracepoint") Signed-off-by: Howard Chu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
commit c6ef3a7fa97ec823a1e1af9085cf13db9f7b3bac upstream. If the uvc_status_init() function fails to allocate the int_urb, it will free the dev->status pointer but doesn't reset the pointer to NULL. This results in the kfree() call in uvc_status_cleanup() trying to double-free the memory. Fix it by resetting the dev->status pointer to NULL after freeing it. Fixes: a31a405 ("V4L/DVB:usbvideo:don't use part of buffer for USB transfer #4") Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Laurent Pinchart <[email protected]> Reviewed by: Ricardo Ribalda <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Summary by Sourcery
Reorganize network structures for better cacheline alignment.
New Features:
tcp_sock
,net_device
, andnetns_ipv4
structures.Enhancements:
tcp_sock
structure to group frequently accessed fields in the same cacheline, improving performance.Tests: