-
Notifications
You must be signed in to change notification settings - Fork 45
/
notes.txt
executable file
·408 lines (377 loc) · 21.8 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
Notes for Homa implementation in Linux:
---------------------------------------
* Thoughts on making TCP and Homa play better together:
* Goals:
* Keep the NIC tx queue from growing long.
* Share bandwidth fairly between TCP and Homa
* If one protocol is using a lot less bandwidth, give it preference
for transmission?
* Balance queue lengths for the protocols?
* Approach #1: analogous to "fair share" CPU scheduling
* Keep separate tx queues for Homa and TCP
* Try to equalize the queue lengths while pacing packets at
network speed?
* Keep track of lengths for each of many queues
* Each queue paces itself based on relative lengths
* Do all pacing centrally, call back to individual queues for output?
* Approach #2:
* Keep track of recent bandwidth consumed by each protocol; when there
is overload, restrict each protocol to its fraction of recent bandwidth.
* "Consumed" has to be measured in terms of bytes offered, not bytes
actually transmitted (otherwise a protocol could get "stuck" at a low
transmittion rate?).
* Approach #3: token bucket
* Use a token bucket for each protocol with 50% of available bandwidth
(or maybe less?). Split any extra available bandwidth among the
protocols. Maybe adjust rates for the token buckets based on recent
traffic?
* Also consider the amount of data that is "stuck" in the NIC?
* Notes on Linux qdiscs:
* Remedies to consider for the performance problems at 100 Gbps, where
one tx channel gets very backed up:
* Implement zero-copy on output in order to reduce memory bandwidth
consumption (presumed with this will increase throughput?)
* Reserve one channel for the pacer, and don't send non-paced packets
on that channel; this should eliminate the latency problems caused
by short messages getting queued on that channel
* Rework cp_node so that there aren't separate senders and receivers on the
client. Instead, have each client thread send, then conditionally receive,
then send again, etc. Hmmm, I believe there is a reason why this won't
work, but I have forgotten what it is.
* (July 2024) Found throughput problem in 2-node "--workload 50000 --one-way"
benchmark. The first packet for message N doesn't get sent until message
N-1 has been completely transmitted. This results in a gap between the
completion of message N and the arrival of a grant for message N+1, which
wastes throughput. Perhaps send an empty data packet for message N+1 so that
the grant can return earlier? Or, make a bigger change:
* Either a message is entirely scheduled or entirely unscheduled.
* For a scheduled message, send a 0-length data packet immediately,
before staring to copy in from user space.
* The round-trip for a grant can now happen in parallel with copying
bytes in from user space (it takes 16 usec to copy in 60 KB at 30 Gbps).
* The grant will be sent more quickly by the server because it doesn't
have to process a batch of data packets first
* homa_grant_recalc is being invoked a *lot* and it takes a lot of time each
call (13 usec/call, duty cycle > 100%):
* Does it need to relock every ranked RPC every call?
* About 25% of calls are failed attempts to promote an unranked RPC
(perhaps because the slots for its host are all taken?)
* Could it abort early if there is no incoming headroom?
* Notes on refactoring of grant mechanism:
* Need to reimplement FIFO grants
* Replace fifo_grant_increment with fifo_grant_interval
* Refactor so that the msgin structure is always properly initialized?
* The current implementation can execute RPCs multiple times on the server:
* The initial RPC arrives, but takes the slow path to SoftIRQ, which can
take many ms.
* Meantime the client retries, and the retry succeeds: the request is
processed, the response is returned to the client, and the RPC is
freed.
* Eventually SoftIRQ wakes up to handle the original packet, which re-creates
the RPC and it gets serviced a second time.
* SoftIRQ processing can lock out kernel-to-user copies; add a preemption
mechanism where the copying code can set a flag that it needs the lock,
then SoftIRQ releases the lock until the flag is clear?
* For W3, throttle_min_bytes is a problem: a significant fraction of all
transmitted bytes aren't being counted; as a result, the NIC queue
can build up. Reducing throttle_min_bytes from 1000 to 200 reduced P99
short message latency from 250 us to 200 us.
* Don't understand why W3 performance is so variable under Gen3. Also,
it's worth comparing tthoma output for W4 and W3 under Gen2; e.g.,
W3 has way more active outgoing messages than W4.
* When requesting ACKs, must eventually give up and delete RPC.
* Eliminate msgin.scheduled: just compute when needed?
* Replace msgin.bytes_remaining with bytes_received?
* Include the GSO packet size in the initial packet, so the receiver
can predict how much grants will get rounded up.
* Out-of-order incoming grants seem to be pretty common.
* Round up bpage fragments to 64 byte boundaries
* CloudLab cluster issues:
* amd272 had a problem where all xmits from core 47 incurred a 1-2 ms
delay; power-cycling it fixed the problem.
* Notes on performance data from buffer benchmarking 8/2023:
* Restricting buffer space ("buffers"):
* Homa W5: unrestricted usage 3.8 MB/20 nodes
Performance degrades significantly below 80% of this (3 MB/20 nodes)
Homa still has better performance than DCTCP down to 1.9 MB
* DCTCP W5: unrestricted usage 5.8 MB/20 nodes
Performance degrades significantly below 25% of this (1.4 MB/20 nodes)
* TCP W5: unrestricted usage 13.1 MB/20 nodes (all available space)
Performance degrades significantly below 80% of this (10.5 MB/20 nodes)
* Varying the number of nodes ("nodes"):
* Homa usage seems to increase gradually with number of nodes
* TCP/DCTCP usage decreases with number of nodes
* Varying the link utilization ("link_util"):
* Homa: 50% utilization drops min. buffer usage by ~4x relative to 80%
Large messages suffer more at higher utilization
* DCTCP benefits less from lower link utilization: min buffer usage
drops by 35% at 50% utilization vs. 80% utilization
All message sizes benefit as utilization drops
* At 50% utilization, Homa's min buffer usage is less than DCTCP! (recheck)
* TCP benefits more than DCTCP: 3x reduction in min buffers at 50% vs. 80%
* Varying the DCTCP marking threshold:
* Min buffer usage increases monotonically with marking threshold
* Slowdown is relatively stable (a bit higher at largest and smallest
thresholds).
* For small messages, P50 is better with low threshold, but P99 is higher
* Dynamic windows for Homa:
* Min buffer usage increases monotonically with larger windows
* Slowdown drops, then increases as window size increases
* Unclear whether there is a point where dwin has a better combination
of performance and buffer usage; maybe in the range of 3-4x rttbytes?
* Varying Homa's overcommitment:
* Results don't seem consistent with other benchmarks
* For W5, dropping overcommitment to 4 increases slowdown by 3x, drops
min buffer usage by 2x
* Even an overcommit of 7 increases slowdown by 2x?
* Varying unsched_bytes:
* For W4 and W5, increasing unsched_bytes from 40K to 120K increases
min buffer usage by 2x
* Issues for next round of full-rack testing:
* Verify that unrestricted slowdown curves for the buffer benchmarks match
the "standard" curves.
* Some metrics seem to vary a lot:
* Unrestricted buffer usage in "nodes" vs. "buffers" vs. "link_util"
* 10% drop in Homa performance: "nodes" vs. "buffers" vs. "link_util"
* DCTCP buffer usage in "nodes" first, drops, then starts to rise;
recheck?
* Need more nodes to see if buffer usage/node eventually plateaus (need to
run all tests at 40 nodes)
* Explore dynamic windows more thoroughly
* Overcommit results look fishy
* Lower the threshold for assuming 0 min buffer usage (perhaps base on us
rather than MB?)
* Create a "scan_metrics" method to scan .metrics files and look for
outliers in multiple different ways?
* Compute a "P99 average" by computing average of worst 1% slowdowns
in 10 buckets (100?) according to message length?
* unsched for w3 has huge variaton in rtts; 40k is particularly bad. It
isn't keeping up on BW either; try raising client-max? Also, there seem
to be too many metrics files (the problem is with active_nodes in cperf.py?)
* recvmsg doesn't seem to return an address if there is an error? May
need to return the address in a different place?
* IPv6 issues:
* See if error checking made syscalls slower.
* GSO always uses SKB_GSO_TCPV6; sometimes it should be V4.
* Refactor of granting mechanism:
* Eliminate grant_increment: change to fifo_grant_increment instead
* grant_non_fifo may need to grant to a message that is also receiving
regular grants
* What if a message receives data beyond incoming, which completes the
message?
* Pinning memory: see mm.h and mm/gup.c
* get_user_page
* get_user_pages
* pin_user_page (not sure the difference from get_user_page)
* Performance-related tasks:
* Rework FIFO granting so that it doesn't consider homa->max_overcommit
(just find the oldest message that doesn't have a pity grant)? Also,
it doesn't look like homa_grant_fifo is keeping track of pity grants
precisely; perhaps add another RPC field for this?
* Re-implement the duty-cycle mechanism. Use a generalized pacer to
control grants:
* Parameters:
* Allowable throughput
* Max accumulation of credits
* Methods:
* Request (current time, amount) (possibly 2 stages: isItOk and doIt?)
* Or, just reduce the link speed and let the pacer handler this?
* Perhaps limit the number of polling threads per socket, to solve
the problems with having lots of receiver threads?
* Move some reaping to the pacer? It has time to spare
* Figure out why TCP W2 P99 gets worse with higher --client-max
* See if turning off c-states allows shorter polling intervals?
* Modify cp_node's TCP to use multiple connections per client-server pair
* Why is TCP beating Homa on cp_server_ports? Perhaps TCP servers are getting
>1 request per kernel call?
* Things to do:
* If a socket is closed with unacked RPCs, the peer will have to send a
long series of NEED_ACKS (which must be ignored because the socket is
gone) before finally reaping the RPCs. Perhaps have a "no such socket"
packet type?
* Reap most of a message before getting an ack? To do this, sender
must include "received offset" in grants; then sender can free
everything up to the latest received offset.
* Try more aggressive retries (e.g. if a missing packet is sufficiently
long ago, don't wait for timeout).
* Eliminate hot spots involving NAPI:
* Arrange for incoming bursts to be divided into batches where
alternate batches do their NAPI on 2 different cores.
* To do this, use TCP for Homa!
* Send Homa packets using TCP, and use different ports to force
different NAPI cores
* Interpose on the TCP packet reception hooks, and redirect
real TCP packets back to TCP.
* Consider replacing grantable list with a heap?
* Unimplemented interface functions.
* Learn about CONFIG_COMPAT and whether it needs to be supported in
struct proto and struct proto_ops.
* Learn about security stuff, and functions that need to be called for this.
* Learn about memory management for sk_buffs: how many is it OK to have?
* See tcp_out_of_memory.
* Eventually initialize homa.next_client_port to something random
* Define a standard mechanism for returning errors:
* Socket not supported on server (or server process ends while
processing request).
* Server timeout
* Is it safe to use non-locking skb queue functions?
* Is the RCU usage for sockets safe? In particular, how long is it safe
to use a homa_sock returned by homa_find_socket? Could it be deleted from
underneath us? This question may no longer be relevant, given the
implementation of homa_find_socket.
* Can a packet input handler be invoked multiple times concurrently?
* What is audit_sockaddr? Do I need to invoke it when I read sockaddrs
from user space?
* When a struct homa is destroyed, all of its sockets end up in an unsafe
state in terms of their socktab links.
* Clean up ports and ips in unit_homa_incoming.c
* Plug into Linux capability mechanism (man(7) capabilities)
* Don't return any errors on sends?
* Homa-RAMCloud doesn't retransmit bytes if it transmitted other bytes
recently; should HomaModule do the same? Otherwise, will retransmit
for requests whose service time is just about equal to the resend timer.
* Check tcp_transmit_skb to make sure we are doing everything we need to
do with skbuffs (e.g., update sk_wmem_alloc?)
* Add support for cgroups (e.g. to manage memory allocation)
* Questions for Linux experts:
* If an interrupt arrives after a thread has been woken up to receive an
incoming message, but before the kernel call returns, is it possible
for the kernel call to return EINTR, such that the message isn't received
and no one else has woken up to handle it?
* OK to call kmalloc at interrupt level?
Yes, but must specify GFP_ATOMIC as argument, not GFP_KERNEL; the operation
will not sleep, which means it could fail more easily.
* Is it OK to retain struct dst_entry pointers for a long time? Can they
ever become obsolete (e.g. because routes change)? It looks like the
"obsolete" field will take care of this. However, a socket is used to
create a dst entry; what if that socket goes away?
* Can flows and struct_dst's be shared across sockets? What information
must be considered to make these things truly safe for sharing (e.g.
source network port?)?
* Source addresses for things like creating flows: can't just use a single
value for this host? Could be different values at different times?
* How to lock between user-level and bottom-half code?
* Must use a spin lock
* Must invoked spin_lock_bh and spin_lock_bh, which disable interrupts
as well as acquire the lock.
* What's the difference between bh_lock_sock and bh_lock_sock_nested?
* Is there a platform-independent way to read a high-frequency clock?
* get_cycles appears to perform a RDTSC
* cpu_khz holds the clock frequency
* do_gettimeofday takes 750 cycles!
* current_kernel_time takes 120 cycles
* sched_clock returns ns, takes 70 cycles
* jiffies variable, plus HZ variable: HZ is 250
* What is the purpose of skbuff clones? Appears that cloning is recommended
to transmit packet while retaining copy for retransmission.
* If there is an error in ip_queue_xmit, does it free the packet?
* The answer appears to be "yes", and Homa contains code to check this
and log if not.
* Is there a better way to compute packet hashes than Homa's approach
in gro_complete?
* Notes on timers:
* hrtimers execute at irq level, not softirq
* Functions to tell what level is current: in_irq(), in_softirq(), in_task()
* Notes on managing network buffers:
* tcp_sendmsg_locked (tcp.c) invokes sk_stream_alloc_skb, which returns 0
if memory running short. It this happens, it invokes sk_stream_wait_memory
* tcp_stream_memory_free: its result indicates if there's enough memory for
a stream to accept more data
* Receiving packets (tcp_v4_rcv -> tcp_v4_do_rcv -> tcp_rcv_state_process
in tcp_ipv4.c)
* There is a variable tcp_memory_allocated, but I can't find where it
is increased; unclear exactly what this variable means.
* There is a variable tcp_memory_pressure, plus functions
tcp_enter_memory_pressure and tcp_leave_memory_pressure. The variable
appears to be modified only by those 2 functions.
* Couldn't find any direct calls to tcp_enter_memory_pressure, but a
pointer is stored in the struct proto.
* That pointer is invoked from sk_stream_alloc_skb and
sk_enter_memory_pressure.
* sk_enter_memory_pressure is invoked from sk_page_frag_refill and
__sk_mem_raise_allocated.
* __sk_mem_raise_allocated is invoked from __sk_mem_schedule
* __sk_mem_schedule is invoked from sk_wmem_schedule and sk_rmem_schedule
* Miscellaneous information:
* For raw sockets: "man 7 raw"
* Per-cpu data structures: linux/percpu.h, percpu-defs.h
* What happens when a socket is closed?
* socket.c:sock_close
* socket.c:sock_release
* proto_ops.release -> af_inet.c:inet_release)
* af_inet.c:inet_release doesn't appear to do anything relevant to Homa
* proto.close -> sock.c:sk_common_release?)
* proto.unhash
* sock_orphan
* sock_put (decrements ref count, frees)
* What happens in a sendmsg syscall (UDP)?
* socket.c:sys_sendmsg
* socket.c:__sys_sendmsg
* socket.c:___sys_sendmsg
* Copy to msghdr and control info to kernel space
* socket.c:sock_sendmsg
* socket.c:sock_sendmsg_nosec
* proto_ops.sendmsg -> afinet.c:inet_sendmsg
* Auto-bind socket, if not bound
* proto.sendmsg -> udp.c:udp_sendmsg
* Long method ...
* ip_output.c:ip_make_skb
* Seems to collect data for the datagram?
* __ip_append_data
* udp.c:udp_send_skb
* Creates UDP header
* ip_output.c:ip_send_skb
* ip_local_out
* Call stack down to driver for TCP sendmsg
tcp.c: tcp_sendmsg
tcp.c: tcp_sendmsg_locked
tcp_output.c: tcp_push
tcp_output.c: __tcp_push_pending_frames
tcp_output.c: tcp_write_xmit
tcp_output.c: __tcp_transmit_skb
ip_output.c: ip_queue_xmit
ip_output.c: ip_local_out
ip_output.c: __ip_local_out
ip_output.c: ip_output
ip_output.c: ip_finish_output
ip_output.c: ip_finish_output_gso
ip_output.c: ip_finish_output2
neighbor.h: neigh_output
neighbor.c: neigh_resolve_output
dev.c: dev_queue_xmit
dev.c: __dev_queue_xmit
dev.c: dev_hard_start_xmit
dev.c: xmit_one
netdevice.h: netdev_start_xmit
netdevice.h: __netdev_start_xmit
vlan_dev.c: vlan_dev_hard_start_xmit
dev.c: dev_queue_xmit
dev.c: __dev_queue_xmit
dev.c: __dev_xmit_skb
sch_generic.c: sch_direct_xmit
dev.c: dev_hard_start_xmit
dev.c: xmit_one
netdevice.h: netdev_start_xmit
netdevice.h: __netdev_start_xmit
en_tx.c: mlx5e_xmit
* Call stack for packet input handling (this is only approximate):
en_txrc.c: mlx5e_napi_poll
en_rx.c: mlx5e_poll_rx_cq
en_rx.c: mlx5e_handle_rx_cqe
dev.c: napi_gro_receive
dev.c: dev_gro_receive
??? protocol-specific handler
dev.c: napi_skb_finish
dev.c: napi_gro_complete
dev.c: netif_receive_skb_internal
dev.c: enqueue_to_backlog
.... switch to softirq core ....
dev.c: process_backlog
dev.c: __netif_receive_skb
dev.c: __netif_receive_skb_core
dev.c: deliver_skb
ip_input.c: ip_rcv
ip_input.c: ip_rcv_finish
ip_input.c: dst_input
homa_plumbing.c: homa_softirq
gcc -g -Wp,-MMD,/users/ouster/homaModule/.homa_offload.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -fmacro-prefix-map=./= -std=gnu11 -fshort-wchar -funsigned-char -fno-common -fno-PIE -fno-strict-aliasing -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables -mindirect-branch=thunk-extern -mindirect-branch-register -mindirect-branch-cs-prefix -mfunction-return=thunk-extern -fno-jump-tables -fpatchable-function-entry=16,16 -fno-delete-null-pointer-checks -O2 -fno-allow-store-data-races -fstack-protector-strong -fno-omit-frame-pointer -fno-optimize-sibling-calls -fno-stack-clash-protection -pg -mrecord-mcount -mfentry -DCC_USING_FENTRY -falign-functions=16 -fno-strict-overflow -fno-stack-check -fconserve-stack -Wall -Wundef -Werror=implicit-function-declaration -Werror=implicit-int -Werror=return-type -Werror=strict-prototypes -Wno-format-security -Wno-trigraphs -Wno-frame-address -Wno-address-of-packed-member -Wmissing-declarations -Wmissing-prototypes -Wframe-larger-than=2048 -Wno-main -Wvla -Wno-pointer-sign -Wcast-function-type -Wno-stringop-overflow -Wno-array-bounds -Wno-alloc-size-larger-than -Wimplicit-fallthrough=5 -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -Wenum-conversion -Wextra -Wunused -Wno-unused-but-set-variable -Wno-unused-const-variable -Wno-packed-not-aligned -Wno-format-overflow -Wno-format-truncation -Wno-stringop-truncation -Wno-override-init -Wno-missing-field-initializers -Wno-type-limits -Wno-shift-negative-value -Wno-maybe-uninitialized -Wno-sign-compare -Wno-unused-parameter -g -gdwarf-4 -g -DMODULE -DKBUILD_BASENAME='"homa_offload"' -DKBUILD_MODNAME='"homa"' -D__KBUILD_MODNAME=kmod_homa -E -o /users/ouster/homaModule/homa_offload.e /users/ouster/homaModule/homa_offload.c ; ./tools/objtool/objtool --hacks=jump_label --hacks=noinstr --hacks=skylake --retpoline --rethunk --stackval --static-call --uaccess --prefix=16 --module /users/ouster/homaModule/homa_offload.o