forked from PlatformLab/HomaModule
-
Notifications
You must be signed in to change notification settings - Fork 0
/
perf.txt
632 lines (570 loc) · 36.6 KB
/
perf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
This file contains various notes and lessons learned concerning performance
of the Homa Linux kernel module. The notes are in reverse chronological
order.
45. (January 2023) Up until now, output messages had to be completely copied
into sk_buffs before transmission could begin. Modified Homa to pipeline
the copy from user space with packet transmission. This makes a significant
difference in performance. For cp_node client --one-way --workload 500000
with MTU 1500, goodput increased from 11 Gbps (see #43 below) to 17-19
Gbps. For comparison, TCP is about 18.5 Gbps .
44. (January 2023) Until now Homa has held an RPC's lock while transmitting
packets for that RPC. This isn't a problem if ip_queue_xmit returns
quickly. However, in some configurations (such as Intel xl170 NICs) the
driver is very slow, and if the NIC can't do TSO for Homa then the packets
passed to the NIC aren't very large. In these situations, Homa will be
transmitting packets almost 100% of the time for large messages, which
means the RPC lock will be held continuously. This locks out other
activities on the RPC, such as processing grants, which causes additional
performance problems. To fix this, Homa releases the RPC lock while
transmitting data packets (ip_queue_xmit or ip6_xmit). This helps a lot
with bad NICs, and even seems to help a little with good NICs (5-10%
increase in throughput for single-flow benchmarks).
43. (December 2022) 2-host throughput measurements (Gbps). Configuration:
* Single message: cp_node client --one-way --workload 500000
Server: one thread, pinned on a "good" core (avoid GRO/SoftIRQ conflicts)
* Multiple messages: client adds "--ports 2 --client-max 8
Server doesn't pin, adds "--port-threads 2" (single port)
* All measurements used rtt_bytes=150000
1.01 2.0 Buf + Short Bypass
---------------------------------------------------------------------
Single message (MTU 1500) 9 11
Single message (MTU 3000) 10-11 13
Multiple messages (MTU 1500) 20-21 21-22
Multiple messages (MTU 3000) 22-23 22-23
Conclusions:
* The new buffering mechanism helps single-message throughput about 20%,
but not much impact when there are many concurrent messages.
* Homa 1.01 seems to be able to hide most of the overhead of
page pool thrashing (#35 below).
42. (December 2022) New cluster measurements with "bench n10_mtu3000" (10
nodes, MTU 3000B) on the following configurations:
Jun 22: Previous measurements from June of 2022
1.01: Last commit before implementing new Homa-allocated buffers
2.0 Buf: Homa-allocated buffers
Grants: 2.0 Buf plus GRO_FAST_GRANTS (incoming grants processed
entirely during GRO)
Short Bypass: 2.0 Buf plus GRO_SHORT_BYPASS (all packets < 1400 bytes
processed entirely during GRO)
Short-message latencies in usecs (fastest short messages taken from
homa_w*.data files, W4NL data taken from unloaded_w4.data):
Jun 22 1.01 2.0 Buf Grants Short Bypass
P50 P99 P50 P99 P50 P99 P50 P99 P50 P99
---------- ---------- ---------- ---------- ----------
W2 38.2 100 37.1 84.7 38.3 87.1 38.9 89.7 27.1 70.5
W3 54.8 269 53.0 263 51.8 211 51.0 216 39.2 216
W4 55.8 189 56.0 207 53.0 113 54.0 128 44.6 106
W5 65.3 223 66.2 232 61.9 133 62.2 154 61.5 150
W4NL 16.6 32.4 15.2 30.1 16.2 30.6 16.2 31.5 13.7 27.1
Best of 5 runs from "bench basic_n10_mtu3000":
1.01 2.0 Buf Grants Short Bypass
------------------------------------------------------------------------
Short-message RTT (usec) 16.1 16.1 16.1 13.5
Single-message throughput (Gbps) 10.2 12.2 12.7 12.5
Client RPC throughput (Mops/s) 1.46 1.51 1.52 1.75
Server RPC throughput (Mops/s) 1.52 1.66 1.63 1.73
Client throughput (Gbps) 23.6 23.7 23.6 23.7
Server throughput (Gbps) 23.6 23.7 23.7 23.7
Conclusions:
* New buffering reduces tail latency >40% for W4 and W5 (perhaps by
eliminating all-at-once message copies that occupy cores for long
periods?). Latency improves by 20-30% (both at P50 and P99) for
all message lengths in W4.
* New buffering improves single-message throughput by 20% (25% when
combined with fast grants)
* Short bypass appears to be a win overall: a bit worse P99 for W5,
but better everywhere else and a significant improvement for short
messages at low load
41. (December 2022) More analysis of SMI interrupts. Wrote smi.cc to gather
data on vents that cause all cores to stop simultaneously. Found 3 distinct
kinds of gaps on xl170 (Intel) CPUs:
* 2.5 usec gaps every 4 ms
* 17 usec gaps every 10 ms (however, these don't seem to be consistent:
they appear for a while at the start of each experiment, then stop)
* 170 usec gaps every 250 ms
I don't know for sure that these are all caused by SMI (e.g., could the
gaps every 4 ms be scheduler wakeups?)
40. (December 2022) NAPI can't process incoming jumbo frames at line rate
for 100 Gbps network (AMD CPUs): it takes at 850 ns to process each
packet (median), but packets are arriving every 700 ns.
Most of the time is spent in __alloc_skb in two places:
kmalloc_reserve for data: 370 ns
prefetchw for last word of data: 140 ns
These times depend on core placements of threads; the above times
are for an "unfortunate" (but typical) placement; with an ideal placement,
the times drop to 100 ns for kmalloc_reserve and essentially 0 for the
prefetch.
Intel CPUs don't seem to have this problem: on the xl170 cluster, NAPI
processes 1500B packets in about 300 ns, and 9000B packets in about
450 ns.
39. (December 2022) One-way throughput for 1M messages varies from 18-27 Gbps
for Homa on the c6525-100g cluster, whereas TCP throughput is relatively
constant at 24 Gbps. Homa's variance comes from core placement: performance
is best if all of NAPI, GRO, and app are in the same group of 3 cores
(3N..3N+2) or their hypertwins. If they aren't, there are significant
cache miss costs as skbs get recycled from the app core back to the NAPI
core. TCP uses RFS to make sure that NAPI and GRO processing happen on
the same core as the application.
38. (December 2022) Restructured the receive buffer mechanism to mitigate
the page_pool_alloc_pages_slow problem (see August 2022 below); packets
can now be copied to user space and their buffers release without waiting
for the entire message to be received. This has a significant impact on
throughput. For "cp_node --one-way --client-max 4 --ports 1 --server-ports 1
--port-threads 8" on the c6525-100g cluster:
* Throughput increased from 21.5 Gbps to 42-45 Gbps
* Page allocations still happen with the new code, but they only consume
0.07 core now, vs. 0.6 core before
37. (November 2022) Software GSO is very slow (17 usec on AMD EPYC processors,
breaking 64K into 9K jumbo frames). The main problem appears to be sk_buff
allocation, which takes multiple usecs because the packet buffers are too
large to be cached in the slab allocator.
36. (November 2022) Intel vs. AMD CPUs. Compared
"cp_node client --workload 500000" performance on c6525-100g cluster
(24-core AMD 7402P processors @ 2.8 Ghz, 100 Gbps networking) vs. xl170
cluster (10-core Intel E5-2640v4 @ 2.4 Ghz, 25 Gbps networking), priorities
not enabled on either cluster:
Intel/25Gbps AMD/100Gbps
-----------------------------------------------------------------------
Packet size 1500B 9000B
Overall throughput (each direction) 3.4 Gbps 6.7-7.5 Gbps
Stats from ttrpcs.py:
Xmit/receive tput 11 Gbps 30-50 Gbps
Copy to/from user space 36-54 Gbps 30-110 Gbps
RTT for first grant 28-32 us 56-70 us
Stats from ttpktdelay.py:
SoftIRQ Wakeup (P50/P90) 6/30 us 14/23 us
Minimum network RTT 5.5 us 8 us
RTT with 100B messages 17 us 28 us
35. (August 2022) Found problem with Mellanox driver that explains the
page_pool_alloc_pages_slow delays in the item below.
* The driver keeps a cache of "free" pages, organized as a FIFO
queue with a size limit.
* The page for a packet buffer gets added to the queue when the
packet is received, but with a nonzero reference count.
* The reference count is decremented when the skbuff is released.
* If the page gets to the front of the queue with a nonzero reference
count, it can't be allocated. Instead, a new page is allocated,
which is slower. Furthermore, this will result in excess pages,
eventually causing the queue to overflow; at that point, the excess
pages will be freed back to Linux, which is slow.
* Homa likes to keep around large numbers of buffers around for
significant time periods; as a result, it triggers the slow path
frequently, especially for large messages.
34. (August 2022) 2-node performance is problematic. Ran experiments with
the following client cp_node command:
cp_node client --ports 3 --server-ports 3 --client-max 10 --workload 500000
With max_window = rtt_bytes = 60000, throughput is only about 10 Gbps
on xl170 nodes. ttpktdelay output shows one-way times commonly 30us or
more, which means Homa can't keep enough grants outstanding for full
bandwidth. The overheads are spread across many places:
IP: IP stack, from calling ip_queue_xmit to NIC wakeup
Net: Additional time until homa_gro_receive gets packet
GRO Other: Time until end of GRO batch
GRO Gap: Delay after GRO packet processing until SoftIRQ handoff
Wakeup: Delay until homa_softirq starts
SoftIRQ: Time in homa_softirq until packet is processed
Total: End-to-end time from calling ip_queue_xmit to homa_softirq
handler for packet
Data packet lifetime (us), client -> server:
Pctile IP Net GRO Other GRO Gap Wakeup SoftIRQ Total
0 0.5 4.6 0.0 0.2 1.0 0.1 7.3
10 0.6 10.3 0.0 5.7 2.0 0.2 21.0
30 0.7 12.4 0.4 6.3 2.1 1.9 27.0
50 0.7 15.3 1.0 6.6 2.2 3.3 32.2
70 0.8 18.2 2.0 8.1 2.3 3.8 45.3
90 1.0 33.9 4.9 31.3 2.5 4.8 62.8
99 1.4 56.5 20.7 48.5 17.7 17.5 85.6
100 16.0 74.3 31.0 61.9 28.3 24.4 111.0
Grant lifetime (us), client -> server:
Pctile IP Net GRO Other GRO Gap Wakeup SoftIRQ Total
0 1.7 2.6 0.0 0.3 1.0 0.0 7.6
10 2.4 5.3 0.0 0.5 1.5 0.1 12.1
30 2.5 10.3 0.0 6.1 2.1 0.1 23.3
50 2.6 12.7 0.5 6.5 2.2 0.2 28.1
70 2.8 16.5 1.1 7.2 2.3 0.3 38.1
90 3.4 31.7 3.5 22.6 2.5 3.1 56.2
99 4.6 54.1 17.7 48.4 17.5 4.3 78.5
100 54.9 67.5 28.4 61.9 28.3 21.9 98.3
Additional client-side statistics:
Pre NAPI: usecs from interrupt entry to NAPI handler
GRO Total: usecs from NAPI handler entry to last homa_gro_receive
Batch: number of packets processed in one interrupt
Gap: usecs from last homa_gro_receive call to SoftIRQ handoff
Pctile Pre NAPI GRO Batch Gap
0 0.7 0.4 0 0.2
10 0.7 0.6 0 0.3
30 0.8 0.7 1 0.4
50 0.8 1.5 2 6.6
70 1.0 2.6 3 7.0
90 2.7 4.9 4 7.5
99 6.4 8.0 7 34.2
100 21.7 23.9 12 48.2
In looking over samples of long delays, there are two common issues that
affect multiple metrics:
* page_pool_alloc_pages_slow; affects:
P90/99 Net, P90/99 GRO Gap, P99 SoftIRQ wakeup
* unidentified 14-17 us gaps in homa_xmit_data, homa_gro_receive,
homa_data_pkt, and other places:
affects P99 GRO Other, P99 SoftIRQ, P99 GRO
In addition, I found the following smaller problems:
* unknown gaps before homa_gro_complete of 20-30 us, affects:
P90 SoftIRQ wakeup
Is this related to the "unidentified 14-17 us gaps" above?
* net_rx_action sometimes slow to start; affects:
Wakeup
* large batch size affects:
P90 SoftIRQ
33. (June 2022) Short-message timelines (xl170 clusters, "cp_node client
--workload 100 --port-receivers 0"). All times are ns (data excludes
client-side recv->send turnaround time). Most of the difference
seems to be in kernel call time and NIC->NIC time. Also, note that
the 5.4.80 times have improved considerably from January 2021; there
appears to be at least 1 us variation in RTT from machine to machine.
5.17.7 5.4.80
Server Client Server Client
----------------------------------------------------------
Send:
homa_send/reply 461 588 468 534
IP/Driver 514 548 508 522
Total 975 1136 1475 1056
Receive:
Interrupt->Homa GRO 923 1003 789 815
GRO 200 227 193 201
Wakeup SoftIRQ 601 480 355 347
IP SoftIRQ 361 441 400 361
Homa SoftIRQ 702 469 588 388
Wakeup App 94 106 87 53
homa_recv 447 562 441 588
Total 3328 3288 2853 2753
Recv -> send kcall 682 220
NIC->NIC (round-trip) 6361 5261
RTT Total 15770 13618
32. (January 2021) Best-case short-message timelines (xl170 cluster).
Linux 4.15.18 numbers were measured in September 2020. All times are ns.
5.4.80 4.15.18 Ratio
Server Client
---------------------------------------------------------
Send:
System call 360 360 240 1.50
homa_send/reply 620 870 420 1.77
IP/Driver 495 480 420 1.16
Total 1475 1710 1080 1.47
Receive:
Interrupt->NAPI 560 500 530 1.00
NAPI 560 675 420 1.47
Wakeup SoftIRQ 480 470 360 1.32
IP SoftIRQ 305 335 320 1.00
Homa SoftIRQ 455 190 240 1.34
Wakeup App 80 100 270 0.33
homa_recv 420 450 300 1.45
System Call 360 360 240 1.50
Total 3220 3080 2680 1.18
NIC->NIC (1-way) 2805 2805 2540 1.10
RTT Total 15100 15100 12600 1.20
31. (January 2021) Small-message latencies (usec) for different workloads and
protocols (xl170 cluster, 40 nodes, high load, MTU 3000, Linux 5.4.80):
W2 W3 W4 W5
Homa P50 30.9 41.9 46.8 55.4
P99 57.7 98.5 109.3 139.0
DCTCP P50 106.7 (3.5x) 160.4 (3.8x) 159.1 (3.4x) 151.8 (2.7x)
P99 4812.1 (83x) 6361.7 (65x) 881.1 (8.1x) 991.2 (7.1x)
TCP P50 108.8 (3.5x) 192.7 (4.6x) 353.1 (7.5x) 385.7 (6.9x)
P99 4151.5 (72x) 5092.7 (52x) 2113.1 (19x) 4360.7 (31x)
30. (January 2021) Analyzed effects of various configuration parameters,
running on 40-node xl170 cluster with MTU 3000:
duty_cycle: Reducing to 40% improves small message latency 25% in W4
40% in W5
fifo_fraction: No impact on small message P99 except W3 (10% degradation);
previous measurements showed 2x improvement in P99 for
largest messages with modified W4 workload.
gro_policy: NORMAL always better; others 10-25% worse for short P99
max_gro_skbs: Larger is better; reducing to 5 hurts short P99 10-15%.
However, anecdotal experience suggests that very large
values can cause long delays for things like sending
grants, so perhaps 10 is best?
max_gso_size: 10K looks best; not much difference above that, 10-20%
degradation of short P99 at 5K
nic_queue_ns: 5-10x degradation in short P99 when there is no limit;
no clear winner for short P99 in 1-10 us range; however,
shorter is better for P50 (1us slightly better than 2us)
poll_usecs: 0-50us all equal for W4 and W5; 50us better for W2 and W3
(10-20% better short P99 than 0us).
ports: Not much sensitivity: 3 server and 3 client looks good.
client threads: Need 3 ports: W2 can't keep up with 1-2 ports, W3 can't
keep up with 1 port. With 3 ports, 2 receivers has 1.5-2x
lower short P99 for W2 and W3 than 4 receivers, but for
W5 3 receivers is 10% better than 2. Best choice: 3p2r?
rtt_bytes: 60K is best, but not much sensitivity: 40K is < 10% worse
throttle_bytes: Almost no noticeable difference from 100-2000; perhaps
500 or 1000?
29. (October 2020) Polling performance impact. In isolation, polling saves
about 4 us RTT per RPC. In the workloads, it reduces short-message P50
up to 10 us, and P99 up to 25us (the impact is greater with light-tailed
workloads like W1 and W2). For W2, polling also improved throughput
by about 3%.
28. (October 2020) Polling problem: some workloads (like W5 with 30 MB
messages) need a lot of receiving threads for occasional bursts where
several threads are tied up receiving very large messages. However,
this same number of receivers results in poor performance in W3,
because these additional threads spend a lot of time polling, which
wastes enough CPU time to impact the threads that actually have
work to do. One possibility: limit the number of polling threads per
socket? Right now it appears hard to configure polling for all
workloads.
27. (October 2020) Experimented with new GRO policy HOMA_GRO_NO_TASKS,
which attempts to avoid cores with active threads when picking cores
for SoftIRQ processing. This made almost no visible difference in
performance, and also depends on modifying the Linux kernel to
export a previously unexported function, so I removed it. It's
still available in repo commits, though.
26. (October 2020) Receive queue order. Experimented with ordering the
hsk->ready_requests and @hsk->ready_responses list to return short
messages first. Not clear that this provided any major benefits, and
it reduced throughput in some cases because of overheads in inserting
ready messages into the queues.
25. (October 2020) NIC queue estimation. Experimented with how much to
underestimate network bandwidth. Answer: not much! The existing 5% margin
of safety leaves bandwidth on the table, which impacts tail latency for
large messages. Reduced it to 1%, which helps large messages a lot (up to
2x reduction in latency). Impact on small messages is mixed (more get worse
than better), but the impact isn't large in either case.
24. (July 2020) P10 under load. Although Homa can provide 13.5 us RTTs under
best-case conditions, this almost never occurs in practice. Even at low
loads, the "best case" (P10) is more like 25-30 us. I analyzed a bunch
of 25-30 us message traces and found the following sources of additional
delay:
* Network delays (from passing packet to NIC until interrupt received)
account for 5-10 us of the additional delay (most likely packet queuing
in the NIC). There could also be delays in running the interrupt handler.
* Every stage of software runs slower, typically taking about 2x as long
(7.1 us becomes 12-23 us in my samples, with median 14.6 us)
* Occasional other glitches, such as having to wake up a receiving
user thread, or interference due to NAPI/SoftIRQ processing of other
messages.
23. (July 2020) Adaptive polling. A longer polling interval (e.g. 500 usecs)
lowers tail latency for heavy-tailed workloads such as W4, but it hurts
other workloads (P999 tail latency gets much worse for W1 because polling
threads create contention for cores; P99 tail latency for large messages
suffers in W3). I attempted an adaptive approach to polling, where a thread
stops polling if it is no longer first in line, and gets woken up later to
resume polling if it becomes first in line again. The hope was that this
would allow a longer polling interval without negatively impacting other
workloads. It did help, but only a bit, and it added a lot of complexity,
so I removed it.
22. (July 2020) Best-case timetraces for short messages on xl170 CloudLab cluster.
Clients: Cum.
Event Median
--------------------------------------------------------------------------
[C?] homa_ioc_send starting, target ?:?, id ?, pid ? 0
[C?] mlx nic notified 939
[C?] Entering IRQ 9589
[C?] homa_gro_receive got packet from ? id ?, offset ?, priority ? 10491
[C?] enqueue_to_backlog complete, cpu ?, id ?, peer ? 10644
[C?] homa_softirq: first packet from ?:?, id ?, type ? 11300
[C?] incoming data packet, id ?, peer ?, offset ?/? 11416
[C?] homa_rpc_ready handed off id ? 11560
[C?] received message while polling, id ? 11811
[C?] Freeing rpc id ?, socket ?, dead_skbs ? 11864
[C?] homa_ioc_recv finished, id ?, peer ?, length ?, pid ? 11987
Servers: Cum.
Event Median
--------------------------------------------------------------------------
[C?] Entering IRQ 0
[C?] homa_gro_receive got packet from ? id ?, offset ?, priority ? 762
[C?] homa_softirq: first packet from ?:?, id ?, type ? 1566
[C?] incoming data packet, id ?, peer ?, offset ?/? 1767
[C?] homa_rpc_ready handed off id ? 2012
[C?] received message while polling, id ? 2071
[C?] homa_ioc_recv finished, id ?, peer ?, length ?, pid ? 2459
[C?] homa_ioc_reply starting, id ?, port ?, pid ? 2940
[C?] mlx nic notified 3685
21. (July 2020) SMIs impact on tail latency. I observed gaps of 200-300 us where
a core appears to be doing nothing. These occur in a variety of places
in the code including in the middle of straight-line code or just
before an interrupt occurs. Furthermore, when these happen, *every* core
in the processor appears to stop at the same time (different cores are in
different places). The gaps do not appear to be related to interrupts (I
instrumented every __irq_entry in the Linux kernel sources), context
switches, or c-states (which I disabled). It appears that the gaps are
caused by System Management Interrupts (SMIs); they appear to account
for about half of the P99 traces I examined in W4.
20. (July 2020) RSS configuration. Noticed that tail latency most often occurs
because too much work is being done by either NAPI or SoftIRQ (or both) on
a single core, which prevents application threads on that core from running.
Tried several alternative approaches to RSS to see if better load balancing
is possible, such as:
* Concentrate NAPI and SoftIRQ packet handling on a small number of cores,
and use core affinity to keep application threads off of those cores.
* Identify an unloaded core for SoftIRQ processing and steer packet batches
to these carefully chosen cores (attempted several different policies).
* Bypass the entire Linux networking stack and call homa_softirq directly
from homa_gro_receive.
* Arrange for SoftIRQ to run on the same core as NAPI (this is more efficient
because it avoids inter-processor interrupts, but can increase contention
on that core).
Most of these attempts made things worse, and none produced dramatic
benefits. In the end, I settled on the following hybrid approach:
* For single-packet batches (meaning the NAPI core is underloaded), process
SoftIRQ on the same core as NAPI. This reduces small-message RTT by about
3 us in underloaded systems.
* When there are packet batches, examine several adjacent cores, and pick
the one for SoftIRQ that has had the least recent NAPI/SoftIRQ work.
Overall, this results in a 20-35% improvement in P99 latency for small
messages under heavy-tailed workloads, in comparison to the Linux default
RSS behavior.
19. (July 2020) P999 latency for small messages. This is 5 ms or more in most
of the workloads, and it turns out to be caused by Linux SoftIRQ handling.
If __do_softirq thinks it is taking too much time, it stops processing
all softirqs in the high-priority NAPI thread, and instead defers them
to another thread, softirqd, which intentionally runs at a low priority
so as not to interfere with user threads. This sometimes means it has to
wait for a full time slice for other threads, which seems to be 5-7 ms.
I tried disabling this feature of __do_softirq, so that all requests get
processed in the high-priority thread, and the P999 latency improved by
about 10x (< 1 ms worst case).
18. (July 2020) Small-message latency. The best-case RTT for small messages
is very difficult to achieve under any real-world conditions. As soon as
there is any load whatsoever, best-case latency jumps from 15 us to 25-40 us
(depending on overall load). The latency CDF for Homa is almost completely
unaffected by load (whereas it varies dramatically with TCP).
17. (July 2020) Small-request optimization: if NAPI and SoftIRQ for a packet
are both done on the same core, it reduces round-trip latency by about
2 us for short messages; however, this works against the optimization below
for spreading out the load. I tried implementing it only for packets that
don't get merged for GRO, but it didn't make a noticeable difference (see
note above about best-case latency for short messages).
16. (June-July 2020) Analyzing tail latency. P99 latency under W4 seems to
occur primarily because of core congestion: a core becomes completely
consumed with either NAPI or SoftIRQ processing (or both) for a long
message, which keeps it from processing a short message. For example,
the user thread that handles the message might be on the congested core,
and hence doesn't run for a long time while the core does NAPI/SoftIRQ
work. I modified Homa's GRO code to pick the SoftIRQ core for each batch
of packets intelligently (choose a core that doesn't appear to be busy
with either NAPI or SoftIRQ processing), and this helped a bit, but not
a lot (10-20% reduction in P99 for W4). Even with clever assignment of
SoftIRQ processing, the load from NAPI can be enough to monopolize a core.
15. (June 2020) Cost of interrupt handler for receiving packets:
mlx5e_mpwqe_fill_rx_skb: 200 ns
napi_gro_receive: 150 ns
14. (June 2020) Does instrumentation slow Homa down significantly? Modified
to run without timetraces and without any metrics except essential ones
for computing priorities:
Latency dropped from 15.3 us to 15.1 us
Small-RPC throughput increased from 1.8 Mops/sec to 1.9 Mops/sec
Large-message throughput didn't change: still about 2.7 MB/sec
Disabling timetraces while retaining metrics roughly splits the
difference. Conclusion: not worth the effort of disabling metrics,
probably not worth turning off timetracing.
13. (June 2020) Implemented busy-waiting, where Homa spins for 2 RTTs
before putting a receiving thread to sleep. This reduced 100B RTT
on the xl170 cluster from 17.8 us to 15.3 us.
12. (May 2020) Noticed that cores can disappear for 10-12ms, during which
softirq handlers do not get invoked. Homa timetraces show no activity
of any kind during that time (e.g., no interrupts either?). Found out
later that this is Homa's fault: there is no preemption when executing
in the kernel, and RPC reaping could potentially run on for a very long
time if it gets behind. Fixed this by adding calls to schedule() so that
SoftIRQ tasks can run.
11. (Mar. 2020) For the slowdown tests, the --port-max value needs to be
pretty high to get true Poisson behavior. It was originally 20, but
increasing it had significant impact on performance for TCP, particularly
for short-message workloads. For example, TCP P99 slowdown for W1 increased
from 15 to 170x when --port-max increase from 20-100. Performance
got even worse at --port-max=200, but I decided to stick with 100 for now.
10. (Mar. 2020) Having multiple threads receiving on a single port makes a
big difference in tail latency. cperf had been using just one receiver
thread for each port (on both clients and servers); changing to
multiple threads reduced P50/P99 slowdown for small messages in W5
from 7/65 to 2.5/7.5!
9.* Performance suffers from a variety of load balancing problems. Here
are some examples:
* (March 2020) Throughput varies by 20% from run to run when a single client
sends 500KB messages to a single server. In this configuration, all
packets arrive through a single NAPI core, which is fully utilized.
However, if Linux also happens to place other threads on that core (such
as the pacer) it takes time away from NAPI, which reduces throughput.
* (March 2020) When analyzing tail latency for small messages in W5, I found
that user threads are occasionally delayed 100s of microseconds in waking
up to handle a message. The problem is that both the NAPI and SoftIRQ
threads happened (randomly) to get busy on that core the same time,
and they completely monopolized the core.
* (March 2020) Linux switches threads between cores very frequently when
threads sleep (2/3 of the time in experiments today).
8. (Feb. 2020) The pacer can potentially be a severe performance bottleneck
(a single thread cannot keep the network utilized with packets that are
not huge). In a test with 2 clients bombarding a single server with
1000-byte packets, performance started off high but then suddenly dropped
by 10x. There were two contributing factors. First, once the pacer got
involved, all transmissions had to go through the pacer, and the pacer
became the bottleneck. Second, this resulted in growth of the throttle
queue (essentially all standing requests: > 300 entries in this experiment).
Since the queue is scanned from highest to lowest priority, every insertion
had to scan the entire queue, which took about 6 us. At this point the queue
lock becomes the bottleneck, resulting in 10x drop in performance.
I tried inserting RPCs from the other end of the throttle queue, but
this still left a 2x reduction in throughput because the pacer couldn't
keep up. In addition, it seems like there could potentially be situations
where inserting from the other end results in long searches. So, I backed
this out.
The solution was to allow threads other than the pacer to transmit packets
even if there are entries on the throttle queue, as long as the NIC queue
isn't long. This allows other threads besides the pacer to transmit
packets if the pacer can't keep up. In order to avoid pacer starvation,
the pacer uses a modified approach: if the NIC queue is too full for it to
transmit a packet immediately, it computes the time when it expects the
NIC queue to get below threshold, waits until that time arrives, and
then transmits; it doesn't check again to see if the NIC queue is
actually below threshold (which it may not be if other threads have
also been transmitting). This guarantees that the pacer will make progress.
7. The socket lock is a throughput bottleneck when a multi-threaded server
is receiving large numbers of small requests. One problem was that the
lock was being acquired twice while processing a single-packet incoming
request: once during RPC initialization to add the RPC to active_rpcs,
and again later to add dispatch the RPC to a server thread. Restructured
the code to do both of these with a single lock acquisition. Also
cleaned up homa_wait_for_message to reduce the number of times it
acquires socket locks. This produced the following improvements, measured
with one server (--port_threads 8) and 3 clients (--workload 100 --alt_client
--client_threads 20):
* Throughput increased from 650 kops/sec to 760
* socket_lock_miss_cycles dropped from 318% to 193%
* server_lock_miss_cycles dropped from 1.4% to 0.7%
6. Impact of load balancing on latency (xl170, 100B RPCs, 11/2019):
1 server thread 18 threads TCP, 1 thread TCP, 18 threads
No RPS/RFS 16.0 us 16.3 us 20.0 us 25.5 us
RPS/RFS enabled 17.1 us 21.5 us 21.9 us 26.5 us
5. It's better to queue a thread waiting for incoming messages at the *front*
of the list in homa_wait_for_message, rather than the rear. If there is a
pool of server threads but not enough load to keep them all busy, it's
better to reuse a few threads rather than spreading work across all of
them; this produces better cache locality). This approach improves latency
by 200-500ns at low loads.
4. Problem: large messages have no pipelining. For example, copying bytes
from user space to output buffers is not overlapped with sending packets,
and copying bytes from buffers to user space doesn't start until the
entire message has been received.
* Tried overlapping packet transmission with packet creation (7/2019) but
this made performance worse, not better. Not sure why.
3. It is hard for the pacer to keep the uplink fully utilized, because it
gets descheduled for long periods of time.
* Tried disabling interrupts while the pacer is running, but this doesn't
work: if a packet gets sent with interrupts disabled, the interrupts get
reenabled someplace along the way, which can lead to deadlock. Also,
the VLAN driver uses "interrupts off" as a signal that it should enter
polling mode, which doesn't work.
* Tried calling homa_pacer_xmit from multiple places; this helps a bit
(5-10%).
* Tried making the pacer thread a high-priority real-time thread; this
actually made things a bit worse.
2. There can be a long lag in sending grants. One problem is that Linux
tries to collect large numbers of buffers before invoking the softirq
handler; this causes grants to be delayed. Implemented max_gro_skbs to
limit buffering. However, varying the parameter doesn't seem to affect
throughput (11/13/2019).
1. Without RPS enabled, Homa performance is limited by a single core handling
all softirq actions. In order for RPS to work well, Homa must implement
its own hash function for mapping packets to cores (the default IP hasher
doesn't know about Homa ports, so it considers only the peer IP address.
However, with RPS, packets can get spread out over too many cores, which
causes poor latency when there is a single client and the server is
underloaded.