forked from PlatformLab/HomaModule
-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
executable file
·451 lines (422 loc) · 22.4 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
Notes for Homa implementation in Linux:
---------------------------------------
* IPv6 issues:
* See if error checking made syscalls slower.
* Refactor of granting mechanism:
* Eliminate grant_increment: change to fifo_grant_increment instead
* grant_non_fifo may need to grant to a message that is also receiving
regular grants
* What if a message receives data beyond incoming, which completes the
message?
* Pinning memory: see mm.h and mm/gup.c
* get_user_page
* get_user_pages
* pin_user_page (not sure the difference from get_user_page)
* Performance-related tasks:
* Improve software GSO by making segments refer to the initial large
buffer instead of copying?
* Rework granting to
* Implement sk_buff caching for output buffers:
* Allocation is slow (2-10 us on AMD processors; check on Intel?)
* Large buffers exceed KMALLOC_MAX_CACHE_SIZE, so they aren't cached
in slabs
* Keep free lists in Homa for different sizes (e.g. pre-GSO and GSO),
append output buffers there
* Can recycle an sk_buff by calling build_skb_around().
* Rework FIFO granting so that it doesn't consider homa->max_overcommit
(just find the oldest message that doesn't have a pity grant)? Also,
it doesn't look like homa_grant_fifo is keeping track of pity grants
precisely; perhaps add another RPC field for this?
* Re-implement the duty-cycle mechanism. Use a generalized pacer to
control grants:
* Parameters:
* Allowable throughput
* Max accumulation of credits
* Methods:
* Request (current time, amount) (possibly 2 stages: isItOk and doIt?)
* Or, just reduce the link speed and let the pacer handler this?
* Analyze 40-us W4 short message latency by writing a time-trace
analyzer that tracks NIC queue length.
* Perhaps limit the number of polling threads per socket, to solve
the problems with having lots of receiver threads?
* Move some reaping to the pacer? It has time to spare
* Figure out why TCP W2 P99 gets worse with higher --client-max
* See if turning off c-states allows shorter polling intervals?
* Consider a permanent reduction in rtt_bytes.
* Consider reducing throttle_min_bytes to see if it helps region 1
in the CDF?
* Modify cp_node's TCP to use multiple connections per client-server pair
* Why is TCP beating Homa on cp_server_ports? Perhaps TCP servers are getting
>1 request per kernel call?
* Things to do:
* If a socket is closed with unacked RPCs, the peer will have to send a
long series of NEED_ACKS (which must be ignored because the socket is
gone) before finally reaping the RPCs. Perhaps have a "no such socket"
packet type?
* Reap most of a message before getting an ack? To do this, sender
must include "received offset" in grants; then sender can free
everything up to the latest received offset.
* Try more aggressive retries (e.g. if a missing packet is sufficiently
long ago, don't wait for timeout).
* Eliminate hot spots involving NAPI:
* Arrange for incoming bursts to be divided into batches where
alternate batches do their NAPI on 2 different cores.
* To do this, use TCP for Homa!
* Send Homa packets using TCP, and use different ports to force
different NAPI cores
* Interpose on the TCP packet reception hooks, and redirect
real TCP packets back to TCP.
* Consider replacing grantable list with a heap?
* Unimplemented interface functions.
* Learn about CONFIG_COMPAT and whether it needs to be supported in
struct proto and struct proto_ops.
* Learn about security stuff, and functions that need to be called for this.
* Learn about memory management for sk_buffs: how many is it OK to have?
* See tcp_out_of_memory.
* Eventually initialize homa.next_client_port to something random
* Define a standard mechanism for returning errors:
* Socket not supported on server (or server process ends while
processing request).
* Server timeout
* Is it safe to use non-locking skb queue functions?
* Is the RCU usage for sockets safe? In particular, how long is it safe
to use a homa_sock returned by homa_find_socket? Could it be deleted from
underneath us? This question may no longer be relevant, given the
implementation of homa_find_socket.
* Can a packet input handler be invoked multiple times concurrently?
* What is audit_sockaddr? Do I need to invoke it when I read sockaddrs
from user space?
* When a struct homa is destroyed, all of its sockets end up in an unsafe
state in terms of their socktab links.
* Clean up ports and ips in unit_homa_incoming.c
* Plug into Linux capability mechanism (man(7) capabilities)
* Don't return any errors on sends?
* Homa-RAMCloud doesn't retransmit bytes if it transmitted other bytes
recently; should HomaModule do the same? Otherwise, will retransmit
for requests whose service time is just about equal to the resend timer.
* Check tcp_transmit_skb to make sure we are doing everything we need to
do with skbuffs (e.g., update sk_wmem_alloc?)
* Add support for cgroups (e.g. to manage memory allocation)
* Questions for Linux experts:
* If an interrupt arrives after a thread has been woken up to receive an
incoming message, but before the kernel call returns, is it possible
for the kernel call to return EINTR, such that the message isn't received
and no one else has woken up to handle it?
* OK to call kmalloc at interrupt level?
Yes, but must specify GFP_ATOMIC as argument, not GFP_KERNEL; the operation
will not sleep, which means it could fail more easily.
* Is it OK to retain struct dst_entry pointers for a long time? Can they
ever become obsolete (e.g. because routes change)? It looks like the
"obsolete" field will take care of this. However, a socket is used to
create a dst entry; what if that socket goes away?
* Can flows and struct_dst's be shared across sockets? What information
must be considered to make these things truly safe for sharing (e.g.
source network port?)?
* Source addresses for things like creating flows: can't just use a single
value for this host? Could be different values at different times?
* How to lock between user-level and bottom-half code?
* Must use a spin lock
* Must invoked spin_lock_bh and spin_lock_bh, which disable interrupts
as well as acquire the lock.
* What's the difference between bh_lock_sock and bh_lock_sock_nested?
* Is there a platform-independent way to read a high-frequency clock?
* get_cycles appears to perform a RDTSC
* cpu_khz holds the clock frequency
* do_gettimeofday takes 750 cycles!
* current_kernel_time takes 120 cycles
* sched_clock returns ns, takes 70 cycles
* jiffies variable, plus HZ variable: HZ is 250
* What is the purpose of skbuff clones? Appears that cloning is recommended
to transmit packet while retaining copy for retransmission.
* If there is an error in ip_queue_xmit, does it free the packet?
* The answer appears to be "yes", and Homa contains code to check this
and log if not.
* How to compute the *real* number of CPUS (<< NR_CPUS?)
* Is there a better way to compute packet hashes than Homa's approach
in gro_complete?
* Notes on IP packet transmission and reception:
* ip_queue_xmit -> ip_local_out -> dst_output
* Ultimately, output is handled by skb_dst(skb)->output(net, sk, skb),
which probably is ip_output
* ip_output -> ip_finish_output -> ip_finish_output2 -> neigh_output?
* Incoming packets:
* Interrupt handlers pass packets to netif_rx
* It queues them in a per-CPU softnet_data structure
* RPS: Receive Packet Steering
* On the destination core, __netif_receive_skb_core is eventually invoked?
* ip_rcv eventually gets called to handle all incoming IP packets
* ip_local_deliver_finish finally calls Homa
* Notes on skbuff usage:
* skb->destructor: invoked when skbuff is freed.
* sk->sk_wmem_alloc:
* Keeps track of memory in write buffers that are being transmitted.
* Prevents final socket cleanup
* Has an extra increment of 1, set when socket allocated
and removed in sk_free (so cleanup won't be done until socket
has been freed)
* sk->sk_write_space: invoked to signal that write space has become available
* skb->truesize: total amount of memory required by this skbuff, including
both the data block and the skbuff header.
* sock_wmalloc: allocates new buffer for writing, limiting to sk->sk_sndbuf
and charging against sk->sk_wm_alloc
* sk->sk_sndbuf: Maximum about of write buffer space that this socket can
consume
* sk->sk_wmem_queued: "persistent queue size" (perhaps buffers that are
queued but not yet ready to transmit?)
* sk->sk_rmem_alloc: appears to count space in read buffers, but it isn't
invoked automatically in the current Homa call structure.
* skb_set_owner_r, sock_rfree: assist in managing sk_rmem_alloc
* nr_free_buffer_pages: appears to return info about total available
memory space, for autosizing buffer usage?
* sysctl_wmem_default: default write buffer space per socket.
* net.ipv4.tcp_mem[0]: if memory usage is below this, no pressure
[1]: start applying memory pressure at this level
[2]: maximum allowed memory usage
* net.ipv4.sysctl_tcp_wmem[0]: minimum sk_sndbuf for a socket
[1]: default sk_sndbuf
[2]: maximum allowable sk_sndbuf
* sk_memory_allocated_add, sk_memory_allocated_sub: keep track of memory
allocated for socket.
* Leads still to follow for skbuff usage:
* Read sock_def_write_space, track variables used to wait for write space,
see how these are used.
* What's the meaning of SOCK_USE_WRITE_QUEUE in sock_wfree?
* Check out sock_alloc_send_pskb
* Check out skb_head_from_pool: allocate faster from processor-specific pool?
* Check out sk_forward_alloc
* Check out tcp_under_memory_pressure
* Check out sk_mem_charge
* How buffer memory can accumulate in Homa:
* Incoming packets: messages not complete, or application doesn't read.
* Outgoing packets: receiver doesn't grant to us.
* Possible remedies for memory congestion:
* Delete incoming messages that aren't active
* Delete incoming messages that application is ignoring
* Delete outgoing messages that aren't getting grants
* Stop receiving data from incoming messages (discard packets, send BUSY)
* Don't accept outbound data: stall in write, or reject
* Notes on timers:
* hrtimers execute at irq level, not softirq
* Functions to tell what level is current: in_irq(), in_softirq(), in_task()
* Detailed switches from normal module builds:
gcc -Wp,-MD,/home/ouster/remote/homaModule/.homa_plumbing.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-linux-gnu/4.9/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -D__KERNEL__ -DCONFIG_CC_STACKPROTECTOR -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -Werror-implicit-function-declaration -Wno-format-security -std=gnu89 -fno-PIE -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -DCONFIG_X86_X32_ABI -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_AVX512=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mindirect-branch=thunk-extern -mindirect-branch-register -DRETPOLINE -fno-delete-null-pointer-checks -O2 --param=allow-store-data-races=0 -DCC_HAVE_ASM_GOTO -Wframe-larger-than=2048 -fstack-protector -Wno-unused-but-set-variable -fno-var-tracking-assignments -g -pg -mfentry -DCC_USING_FENTRY -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack -Werror=implicit-int -Werror=strict-prototypes -Werror=date-time -DMODULE -DKBUILD_BASENAME='"homa_plumbing"' -DKBUILD_MODNAME='"homa"' -c -o /home/ouster/remote/homaModule/.tmp_homa_plumbing.o /home/ouster/remote/homaModule/homa_plumbing.c
./tools/objtool/objtool orc generate --module --no-fp --retpoline "/home/ouster/remote/homaModule/.tmp_homa_plumbing.o"
* TCP socket close: socket_file_ops in socket.c (.release)
-> sock_close -> sock_release -> proto_ops.release
-> inet_release (af_inet.c) -> sk->sk_prot->close
-> tcp_close (tcp.c)
* How to pair requests and responses?
* Choice #1: extend addresses to include an RPC id:
* On client send, destination address has an id of 0; kernel fills in
correct id.
* On receive, the source address includes the RPC id (both client and server)
* On server send, destination address has a non-zero id (the one from
the receive): this is used to pair the response with a particular request.
Analysis:
* The RPC ID doesn't exactly fit as part of addresses, though it is close.
* Doesn't require a change in API.
* Can the kernel modify the address passed to sendmsg? What if the
application invokes write instead of sendmsg?
* Choice #2: perform sends and receives with an ioctl that can be used
to pass RPC ids.
Analysis:
* Results in what is effectively a new interface.
* Choice #3: put the RPC Id in the message at the beginning. The client
selects the id, not the kernel, but the kernel will interpret these
ids both on sends and receives.
Analysis:
* Awkward interaction between client and kernel, with the kernel
now interpreting what used to be just an uninterpreted blob of data.
* Will probably result in more application code to read and write
the ids; unclear that this can be hidden from app.
* Choice #4: define a new higher-level application API; it won't matter
what the underlying kernel calls are:
homa_send(fd, address, msg) -> id
homa_recv(fd, buffer) -> id, length, sender_address, is_request
homa_invoke(fd, address, request, response) -> response_length
homa_reply(fd, address, id, msg)
* Notes on managing network buffers:
* tcp_sendmsg_locked (tcp.c) invokes sk_stream_alloc_skb, which returns 0
if memory running short. It this happens, it invokes sk_stream_wait_memory
* tcp_stream_memory_free: its result indicates if there's enough memory for
a stream to accept more data
* Receiving packets (tcp_v4_rcv -> tcp_v4_do_rcv -> tcp_rcv_state_process
in tcp_ipv4.c)
* There is a variable tcp_memory_allocated, but I can't find where it
is increased; unclear exactly what this variable means.
* There is a variable tcp_memory_pressure, plus functions
tcp_enter_memory_pressure and tcp_leave_memory_pressure. The variable
appears to be modified only by those 2 functions.
* Couldn't find any direct calls to tcp_enter_memory_pressure, but a
pointer is stored in the struct proto.
* That pointer is invoked from sk_stream_alloc_skb and
sk_enter_memory_pressure.
* sk_enter_memory_pressure is invoked from sk_page_frag_refill and
__sk_mem_raise_allocated.
* __sk_mem_raise_allocated is invoked from __sk_mem_schedule
* __sk_mem_schedule is invoked from sk_wmem_schedule and sk_rmem_schedule
* Waiting for input in TCP:
* tcp_recvmsg (tcp.c) -> sk_wait_data (sock.c)
* Waits for a packet to arrive in sk->sk_receive_queue (loops)
* tcp_v4_rcv (tcp_ipv4.c) -> tcp_v4_do_rcv
-> tcp_rcv_established (tcp_input.c) -> sk->sk_data_ready
-> sock_def_readable (sock.c)
* Wakes up sk->sk_wq
* Waiting for input in UDP:
* udp_recvmsg -> __skb_recv_udp -> __skb_wait_for_more_packets (datagram.c)
* Sleeps process with no loop
* udp_rcv -> __udp4_lib_rcv -> udp_queue_rcv_skb -> __udp_queue_rcv_skb
-> __udp_enqueue_schedule_skb -> sk->sk_data_ready
-> sock_def_readable (sock.c)
* Wakes up sk->sk_wq
* Notes on waiting:
* sk_data_ready function looks like it will do most of the work for waking
up a sleeping process. sock_def_readable is the default implementation.
* On send:
* Immediately copy message into sk_buffs.
* Client assigns message id; it's the first 8 bytes of the message data.
* Return before sending entire message.
* Homa keeps track of outstanding requests (some limit per socket?).
* If message fails, kernel must fabricate a response. Perhaps all
responses start with an id and a status?
* Tables needed:
* All Homa sockets
* Used to assign new port numbers
* Used to dispatch incoming packets
* Need RCU or some other kind of locking?
* Outgoing RPCs (for a socket?)
* Used to find state for incoming packets
* Used for cleanup operations (socket closure, cancellation, etc.)
* Used for detecting timeouts
* No locks needed: use existing socket lock
* Or, have one table for all sockets?
* Outgoing requests that haven't yet been transmitted:
* For scheduling outbound traffic
* Must be global?
* Outgoing responses that haven't yet been transmitted:
* For scheduling outbound traffic
* Must be global?
* Incoming RPCs:
* Use to find state for incoming packets
* Miscellaneous information:
* For raw sockets: "man 7 raw"
* Per-cpu data structures: linux/percpu.h, percpu-defs.h
* API for applications
* Ideally, sends are asynchronous:
* The send returns before the message has been sent
* Data has been copied out of application-level buffers, so
buffers can be reused
* Must associate requests and responses:
* A response is different from a request.
* Kernel may need to keep track of open requests, so that it
can handle RESEND packets appropriately; what if application
doesn't respond, and an infinite backlog of open requests
builds up? Must limit the kernel state that accumulates.
* Maybe application must be involved in RESENDs?
* On receive, application must provide space for largest possible message
* Or, receives must take 2 system calls, one to get the size and
one to get the message.
* Support a polling API for incoming messages?
* Client provides buffer space in advance
* Kernel fills in data as packets arrive
* Client can poll memory to see when new messages arrive
* This would minimize sk_buff usage in the kernel
* Is there a way for the kernel to access client memory when
the process isn't active?
* Can buffer space get fragmented? For example, the first part of
a long message arrives, but the rest doesn't; meanwhile, buffers
fill up and wrap around.
* On receive, avoid copies of large message bodies? E.g., deliver only
header to the application, then it can come back later and request
that the body be copied to a particular spot.
* Provide a batching mechanism to avoid a kernel call for each message?
* What happens when a socket is closed?
* socket.c:sock_close
* socket.c:sock_release
* proto_ops.release -> af_inet.c:inet_release)
* af_inet.c:inet_release doesn't appear to do anything relevant to Homa
* proto.close -> sock.c:sk_common_release?)
* proto.unhash
* sock_orphan
* sock_put (decrements ref count, frees)
* What happens in a connect syscall (UDP)?
* socket.c:sys_connect
* proto_ops.connect -> af_inet.c:inet_dgram_connect
* proto.connect -> datagram.c:ip4_datagram_connect
* datagram.c: __ip4_datagram_connect
* What happens in a bind syscall (UDP)?
* socket.c:sys_bind
* proto_ops.bind -> afinet.c:inet_bind
* proto.bind -> (not defined for UDP)
* If no proto.bind handler, then a bunch of obscure -looking stuff
happens.
* What happens in a sendmsg syscall (UDP)?
* socket.c:sys_sendmsg
* socket.c:__sys_sendmsg
* socket.c:___sys_sendmsg
* Copy to msghdr and control info to kernel space
* socket.c:sock_sendmsg
* socket.c:sock_sendmsg_nosec
* proto_ops.sendmsg -> afinet.c:inet_sendmsg
* Auto-bind socket, if not bound
* proto.sendmsg -> udp.c:udp_sendmsg
* Long method ...
* ip_output.c:ip_make_skb
* Seems to collect data for the datagram?
* __ip_append_data
* udp.c:udp_send_skb
* Creates UDP header
* ip_output.c:ip_send_skb
* ip_local_out
* Call stack down to driver for TCP sendmsg
tcp.c: tcp_sendmsg
tcp.c: tcp_sendmsg_locked
tcp_output.c: tcp_push
tcp_output.c: __tcp_push_pending_frames
tcp_output.c: tcp_write_xmit
tcp_output.c: __tcp_transmit_skb
ip_output.c: ip_queue_xmit
ip_output.c: ip_local_out
ip_output.c: __ip_local_out
ip_output.c: ip_output
ip_output.c: ip_finish_output
ip_output.c: ip_finish_output_gso
ip_output.c: ip_finish_output2
neighbor.h: neigh_output
neighbor.c: neigh_resolve_output
dev.c: dev_queue_xmit
dev.c: __dev_queue_xmit
dev.c: dev_hard_start_xmit
dev.c: xmit_one
netdevice.h: netdev_start_xmit
netdevice.h: __netdev_start_xmit
vlan_dev.c: vlan_dev_hard_start_xmit
dev.c: dev_queue_xmit
dev.c: __dev_queue_xmit
dev.c: __dev_xmit_skb
sch_generic.c: sch_direct_xmit
dev.c: dev_hard_start_xmit
dev.c: xmit_one
netdevice.h: netdev_start_xmit
netdevice.h: __netdev_start_xmit
en_tx.c: mlx5e_xmit
* Call stack for packet input handling (this is only approximate):
en_txrc.c: mlx5e_napi_poll
en_rx.c: mlx5e_poll_rx_cq
en_rx.c: mlx5e_handle_rx_cqe
dev.c: napi_gro_receive
dev.c: dev_gro_receive
??? protocol-specific handler
dev.c: napi_skb_finish
dev.c: napi_gro_complete
dev.c: netif_receive_skb_internal
dev.c: enqueue_to_backlog
.... switch to softirq core ....
dev.c: process_backlog
dev.c: __netif_receive_skb
dev.c: __netif_receive_skb_core
dev.c: deliver_skb
ip_input.c: ip_rcv
ip_input.c: ip_rcv_finish
ip_input.c: dst_input
homa_plumbing.c: homa_softirq