-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow txg_kick() even there is a txg still syncing #11929
Conversation
Current txg_kick() only works if there is no txg in either quiescing or syncing. This commit allow txg_kick() to work as long as there is no txg is quiescing. This keeps syncing stage busy, which should benefit write throughput. Also txg_kick() should be triggered on dp_dirty_pertxg[] instead of dp_dirty_total, because dp_dirty_total won't be decreased until the dirty txg is synced. This change prevent pushing a empty txg through pipeline in a row. Signed-off-by: jxdking <[email protected]>
FYI I've been running with this for a couple of days - I haven't benchmarked it but it's been behaving fine so far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle this makes good sense to me to help smooth out the txgs.
The test below is based on my first commit. Conclusion is at end of this post. setup: zpool create -o ashift=12 -O primarycache=metadata test /tmp/mytempfs/d1 Parameter echo $(expr 1024 \* 1024 \* 256) >> /sys/module/zfs/parameters/zfs_dirty_data_max
echo $(expr 1024 \* 1024 \* 1024) >> /sys/module/zfs/parameters/zfs_arc_min
echo 0 >> /sys/module/zfs/parameters/zfs_abd_scatter_enabled
echo 30 >> /sys/module/zfs/parameters/zfs_dirty_data_sync_percent Note: zfs_dirty_data_sync_percent is set to 30%. I found there is fixed overhead to sync a txg regardless the txg size. Basically, you want to keep txg size as large as possible, so that you can use less number of txgs to get the job done. However the dirty data is soft capped by zfs_delay_min_dirty_percent (60%), and you also want at least 2 txgs in the pipeline at the same time, one for syncing, one for open. Thus, 30% might be the optimal value. .fio file:
before patch WRITE: bw=2312MiB/s (2425MB/s), 62.5MiB/s-93.4MiB/s (65.5MB/s-97.9MB/s), io=67.8GiB (72.8GB), run=30001-30036msec
WRITE: bw=2364MiB/s (2479MB/s), 63.3MiB/s-100MiB/s (66.3MB/s-105MB/s), io=69.3GiB (74.5GB), run=30001-30037msec
WRITE: bw=2366MiB/s (2481MB/s), 62.9MiB/s-96.8MiB/s (65.9MB/s-102MB/s), io=69.4GiB (74.5GB), run=30001-30047msec
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
34469 44413228261394 C 81412096 0 80900096 0 680 38990172 90459 39504 31888244
34470 44413267251566 C 80756736 0 80183296 0 676 35474837 95056 15563 33482372
34471 44413302726403 C 81543168 0 81043456 0 678 36462139 6058754 20627 26053420
34472 44413339188542 C 81412096 0 80924672 0 679 35180551 5144063 16541 31480321
34473 44413374369093 C 80756736 0 80134144 0 662 41049444 297837 14129 32115820
34474 44413415418537 C 80887808 0 80412672 0 679 35711703 117254 17481 33287954
34475 44413451130240 C 80756736 45056 80269312 1 679 37546015 221847 20722 37446051
34476 44413488676255 C 81412096 0 80912384 0 677 41115482 4964932 44709 28192250
34477 44413529791737 C 81674240 0 81088512 0 681 36332028 114537 23074 31926800
34478 44413566123765 C 81412096 0 80842752 0 680 35750331 145100 39345 34063030
34479 44413601874096 C 81412096 0 80936960 0 686 37437573 124580 13599 34473612
34480 44413639311669 C 81018880 0 80384000 0 669 38736130 122841 34216 34415495
34481 44413678047799 C 81018880 0 80433152 0 673 38211708 219022 10173 32172248
34482 44413716259507 C 81018880 0 80523264 0 673 35520291 232346 18448 31529957
34483 44413751779798 C 80625664 0 80056320 0 672 34776822 194787 13009 32898429
34484 44413786556620 C 81412096 0 80928768 0 677 36494459 108806 18843 32076272
34485 44413823051079 C 80625664 0 80105472 0 670 35260730 59518 32022 31544043
34486 44413858311809 C 80887808 0 80453632 0 675 34858642 5691509 13785 27406610
34487 44413893170451 S 0 0 0 0 0 36041951 206593 22816 0
34488 44413929212402 O 0 0 0 0 0 0 0 0 0 after patch WRITE: bw=2367MiB/s (2482MB/s), 64.3MiB/s-88.4MiB/s (67.4MB/s-92.7MB/s), io=69.5GiB (74.6GB), run=30001-30068msec
WRITE: bw=2397MiB/s (2514MB/s), 64.3MiB/s-93.2MiB/s (67.4MB/s-97.7MB/s), io=70.3GiB (75.5GB), run=30001-30029msec
WRITE: bw=2359MiB/s (2474MB/s), 62.1MiB/s-96.4MiB/s (65.1MB/s-101MB/s), io=69.2GiB (74.3GB), run=30001-30041msec
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
38602 45256675295546 C 81281024 0 80781312 0 682 37331366 327246 31073 32381340
38603 45256712626912 C 81281024 0 80658432 0 676 35745088 5685674 42948 26099076
38604 45256748372000 C 81281024 0 80605184 0 672 35411762 7509642 156310 29021598
38605 45256783783762 C 81412096 0 80834560 0 692 39907295 180550 37408 32893259
38606 45256823691057 C 81543168 0 81014784 0 680 36284749 6010436 106064 26588395
38607 45256859975806 C 81412096 0 80736256 0 678 36014189 6982617 27029 23381964
38608 45256895989995 C 81018880 0 80478208 0 678 33998979 244077 43094 30736135
38609 45256929988974 C 81281024 0 80789504 0 682 34759192 5729245 31807 25797541
38610 45256964748166 C 80887808 0 80359424 0 675 34697286 151429 29071 30306521
38611 45256999445452 C 81281024 0 80605184 0 669 33319205 5195582 23199 25388281
38612 45257032764657 C 80887808 0 80236544 0 670 33716458 108571 292652 30580742
38613 45257066481115 C 81281024 0 80629760 0 681 34141174 5460094 213898 25590267
38614 45257100622289 C 81149952 0 80486400 0 678 34771317 8126630 35692 24200981
38615 45257135393606 C 80756736 0 80203776 0 677 35418189 7796445 13185 25106909
38616 45257170811795 C 80887808 0 80248832 0 672 36260117 185389 21869 29588025
38617 45257207071912 C 81149952 0 80596992 0 682 32665857 2732853 66327 29102961
38618 45257239737769 C 81149952 0 80474112 0 669 34803758 3981020 33005 28281408
38619 45257274541527 C 81412096 0 80748544 0 677 35385203 118840 17162 32180451
38620 45257309926730 S 0 0 0 0 0 35474999 114903 26184 0
38621 45257345401729 O 0 0 0 0 0 0 0 0 0 Note that the wtime remains small after the patch. There is no txg with "W" status observed, which suggest the fill rate of open txg is slower than sync rate. The open stage is the bottle neck of the whole pipeline. The fio results between "before" and "after" patches are within margin of error. I can say the overhead of extra mutex lock can be ignored. I accidentally find "abd scatter" has negative impact on performance by 20 percent. See below. WRITE: bw=1966MiB/s (2062MB/s), 51.9MiB/s-87.4MiB/s (54.4MB/s-91.7MB/s), io=57.7GiB (61.9GB), run=30001-30044msec
WRITE: bw=1924MiB/s (2017MB/s), 51.1MiB/s-79.6MiB/s (53.5MB/s-83.5MB/s), io=56.5GiB (60.6GB), run=30001-30052msec
WRITE: bw=1907MiB/s (1999MB/s), 50.6MiB/s-80.2MiB/s (53.1MB/s-84.1MB/s), io=55.9GiB (60.1GB), run=30001-30044msec
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
45112 46616059350565 C 81281024 0 80605184 0 675 46312283 219458 14457 42556757
45113 46616105662848 C 81805312 0 81190912 0 684 47335452 359614 21052 41163965
45114 46616152998300 C 81149952 0 80678912 0 679 46887698 12228973 39583 29266544
45115 46616199885998 C 57688064 0 57012224 0 489 12512219 8332132 20771395 42377446
45116 46616212398217 C 81805312 0 81141760 0 683 71568296 3383276 24245 36827690
45117 46616283966513 C 81149952 0 80539648 0 678 44743970 10766099 32691 31341262
45118 46616328710483 C 81149952 0 80527360 0 674 46646751 220928 21371 40313402
45119 46616375357234 C 81149952 0 80527360 0 676 45575267 10715816 20249 29692062
45120 46616420932501 C 81281024 0 80748544 0 680 44971173 236248 25576 42036086
45121 46616465903674 C 81149952 0 80637952 0 680 46693765 214532 25288 43491260
45122 46616512597439 C 81281024 0 80760832 0 676 48310498 96684 6341685 35546532
45123 46616560907937 C 81149952 0 80588800 0 682 46402936 207012 22675 43780165
45124 46616607310873 C 81281024 0 80703488 0 683 48462547 13665952 22716 30395226
45125 46616655773420 C 57294848 0 56705024 0 494 13924883 224172 30015892 39373770
45126 46616669698303 C 101728256 0 101126144 0 835 73702605 4462872 27321 41168582
45127 46616743400908 C 80887808 0 80199680 0 669 55312100 7756103 49028 35153178
45128 46616798713008 C 81281024 0 80719872 0 676 47073669 408426 12489 40645494
45129 46616845786677 C 81149952 0 80637952 0 681 46721243 2543877 106690 38293195
45130 46616892507920 S 0 0 0 0 0 46271377 8687127 33222 0
45131 46616938779297 O 0 0 0 0 0 0 0 0 0 Now, I try to slow down the sync stage by adding gzip compression. I put buffer_compress_percentage=70 into .fio file to make sure the compression is kicked in. before patch compression=gzip WRITE: bw=542MiB/s (568MB/s), 13.6MiB/s-23.6MiB/s (14.2MB/s-24.8MB/s), io=16.1GiB (17.2GB), run=30270-30358msec
WRITE: bw=496MiB/s (520MB/s), 11.0MiB/s-24.9MiB/s (12.5MB/s-26.1MB/s), io=14.6GiB (15.7GB), run=30176-30189msec
WRITE: bw=507MiB/s (532MB/s), 12.8MiB/s-26.9MiB/s (13.4MB/s-28.2MB/s), io=14.0GiB (16.1GB), run=30182-30191msec
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
59754 62524099457587 C 81018880 0 25616384 0 680 563753583 604707 26596 305528904
59755 62524663211170 C 140787712 0 44441600 0 1136 306248761 3026303 22437 497137187
59756 62524969459931 C 81018880 0 25714688 0 700 500277256 156580 1447246 314366264
59757 62525469737187 C 151011328 0 47648768 0 1220 322599153 3290237 30574 526694584
59758 62525792336340 C 80887808 0 25722880 0 679 533332264 4337 57648 299709676
59759 62526325668604 C 141574144 0 44552192 0 1141 299859969 128281 5261 495275433
59760 62526625528573 C 91111424 0 28794880 0 766 498822704 3994640 31961 326383169
59761 62527124351277 C 131088384 0 41373696 0 1075 330498419 146105 11557 464895527
59762 62527454849696 C 91111424 0 29016064 0 783 465150176 150680 9680 344827363
59763 62527919999872 C 130957312 0 41590784 0 1088 345078704 145752 8869 467954698
59764 62528265078576 C 91111424 0 28917760 0 773 468198146 109856 21032 329253132
59765 62528733276722 C 133709824 0 42364928 0 1086 330848546 527090 53951 468862927
59766 62529064125268 C 88489984 0 27975680 0 745 469540196 152291 4839014 321392357
59767 62529533665464 C 133709824 0 42094592 0 1072 326470809 100641 23589 481451266
59768 62529860136273 C 88621056 0 28028928 0 753 481675727 153903 8546 329382950
59769 62530341812000 C 133709824 0 42168320 0 1083 329632969 109922 6349364 471987117
59770 62530671444969 C 92422144 0 29364224 0 758 479269914 95419 22266 338497154
59771 62531150714883 C 137510912 0 43356160 0 1116 340680633 59653 23560 479593707
59772 62531491395516 S 0 0 0 0 0 479781560 106079 24871 0
59773 62531971177076 O 0 0 0 0 0 0 0 0 0 after patch compression=gzip WRITE: bw=542MiB/s (568MB/s), 14.5MiB/s-22.0MiB/s (15.2MB/s-24.1MB/s), io=15.9GiB (17.1GB), run=30001-30026msec
WRITE: bw=491MiB/s (515MB/s), 12.9MiB/s-23.8MiB/s (13.5MB/s-24.9MB/s), io=14.4GiB (15.5GB), run=30076-30087msec
WRITE: bw=506MiB/s (531MB/s), 13.7MiB/s-22.8MiB/s (14.3MB/s-23.0MB/s), io=14.9GiB (15.0GB), run=30088-30103msec
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
60351 63292441072166 C 80625664 0 25665536 0 703 297278152 86551 286099131 293857052
60352 63292738350318 C 80625664 0 25432064 0 684 293177147 6684494 280244100 316954118
60353 63293031527465 C 80756736 0 25620480 0 692 298168675 2631230 303171202 289437707
60354 63293329696140 C 80756736 0 25620480 0 690 311718638 93003 283509921 292115530
60355 63293641414778 C 80756736 0 25546752 0 704 292827654 16694 282958337 291062908
60356 63293934242432 C 80756736 0 25473024 0 689 291490036 17600488 265024507 294677914
60357 63294225732468 C 81018880 0 25739264 0 695 290111552 147367 287134821 295117765
60358 63294515844020 C 80756736 0 25473024 0 690 293783917 40360 288662013 292821039
60359 63294809627937 C 80756736 0 25460736 0 679 294224677 37767 287350163 295239715
60360 63295103852614 C 80887808 0 25587712 0 706 292917223 279658 289516676 289092422
60361 63295396769837 C 81018880 0 25579520 0 693 296995015 883183 281102345 306474454
60362 63295693764852 C 80625664 0 25444352 0 668 287265507 65619 301226894 311818433
60363 63295981030359 C 80756736 0 25595904 0 696 319446207 9104382 284642452 303984653
60364 63296300476566 C 81281024 0 25833472 0 691 297675424 59914 300081102 291334517
60365 63296598151990 C 80887808 0 25661440 0 688 304402514 41802 287117406 290045658
60366 63296902554504 C 80756736 0 25620480 0 690 294605822 31144863 251541507 296751848
60367 63297197160326 C 80756736 0 25460736 0 684 287862885 39212 291608968 291407922
60368 63297485023211 S 0 0 0 0 0 302319539 717188 280107850 0
60369 63297787342750 W 0 0 0 0 0 285875039 98920 0 0
60370 63298073217789 O 0 0 0 0 0 0 0 0 0 A txg with "W" status is observed. However, there is no difference on performance number. Then I try lz4. WRITE: bw=2131MiB/s (2235MB/s), 58.0MiB/s-90.8MiB/s (60.9MB/s-95.2MB/s), io=62.5GiB (67.1GB), run=30001-30009msec
WRITE: bw=2137MiB/s (2241MB/s), 58.3MiB/s-82.7MiB/s (61.1MB/s-86.7MB/s), io=62.7GiB (67.3GB), run=30001-30033msec
WRITE: bw=2120MiB/s (2223MB/s), 57.3MiB/s-85.4MiB/s (60.1MB/s-89.6MB/s), io=62.2GiB (66.7GB), run=30001-30017msec
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
59497 62295821826121 C 156778496 0 54571008 0 735 53757120 1050416 23781 60254898
59498 62295875583241 C 80887808 0 28323840 0 410 64262346 133175 22957 52541648
59499 62295939845587 C 155992064 0 54206464 0 772 55745896 3385100 28723 63430302
59500 62295995591483 C 80756736 0 28360704 0 573 69832912 293554 40418 55024905
59501 62296065424395 C 154550272 0 53481472 0 718 58163148 4644850 27067 57527882
59502 62296123587543 C 80625664 0 28073984 0 452 64489090 47688 66250 51760139
59503 62296188076633 C 154419200 0 53702656 0 726 55520364 4505138 24659 53914306
59504 62296243596997 C 80887808 0 28303360 0 441 60844367 109522 12151 52066377
59505 62296304441364 C 155992064 0 54095872 0 765 55140233 66336 61905 60298365
59506 62296359581597 C 81412096 0 28467200 0 439 62989339 60392 18726 52740479
59507 62296422570936 C 149438464 0 51798016 0 933 52912442 2066965 668097 51618080
59508 62296475483378 C 81149952 0 28229632 0 511 55958937 115419 20038 44683512
59509 62296531442315 C 139870208 0 48553984 0 744 44923978 112406 7146 58676368
59510 62296576366293 C 103563264 0 35975168 0 601 63217211 12895231 30225 42121597
59511 62296639583504 C 142622720 0 49623040 0 663 60099689 5935466 29119 55191232
59512 62296699683193 C 90456064 0 31752192 0 612 63901173 4079559 32528 46513238
59513 62296763584366 C 136593408 0 47538176 0 846 51986673 109868 12668 57109625
59514 62296815571039 C 92815360 0 32604160 0 626 58741366 129464 25406 56746225
59515 62296874312405 S 0 0 0 0 0 61258339 5083757 24812 0
59516 62296935570744 O 0 0 0 0 0 0 0 0 0 after patch compression=lz4 WRITE: bw=1923MiB/s (2017MB/s), 52.6MiB/s-78.2MiB/s (55.2MB/s-82.0MB/s), io=56.4GiB (60.5GB), run=30001-30015msec
WRITE: bw=1955MiB/s (2050MB/s), 55.5MiB/s-74.6MiB/s (58.2MB/s-78.3MB/s), io=57.3GiB (61.5GB), run=30001-30019msec
WRITE: bw=1945MiB/s (2039MB/s), 53.1MiB/s-76.9MiB/s (55.6MB/s-80.7MB/s), io=57.0GiB (61.2GB), run=30001-30014msec
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
56569 58662912459068 C 80887808 0 28545024 0 491 43956263 223894 38125337 40469058
56570 58662956415331 C 81018880 0 28479488 0 578 42542788 149993 36205815 40886703
56571 58662998958119 C 81412096 0 28651520 0 458 40867326 6941001 29516076 41789555
56572 58663039825445 C 80887808 0 28307456 0 516 41185442 111783 37033506 43187297
56573 58663081010887 C 81149952 0 28532736 0 612 41650415 18136308 20630497 54277163
56574 58663122661302 C 81018880 0 28413952 0 516 43880870 4484189 44775201 46459031
56575 58663166542172 C 81149952 0 28364800 0 497 54876440 1460653 39458765 41243889
56576 58663221418612 C 80887808 0 28237824 0 510 44953578 198680 37118188 41593651
56577 58663266372190 C 80887808 0 28262400 0 555 41440694 3278676 34275200 47494209
56578 58663307812884 C 81018880 0 28319744 0 671 42433606 2265634 40429803 45357353
56579 58663350246490 C 81018880 0 28229632 0 536 47684134 1078895 39393881 40493091
56580 58663397930624 C 81412096 0 28565504 0 476 44713983 179891 36148825 43762076
56581 58663442644607 C 81018880 0 28340224 0 536 40719026 2519254 36937317 41403806
56582 58663483363633 C 81018880 0 28356608 0 447 43673183 2848838 34422591 41548927
56583 58663527036816 C 81281024 0 28291072 0 417 41327437 244554 37329331 44122044
56584 58663568364253 C 80756736 0 28323840 0 483 43497819 9150281 29134688 45397085
56585 58663611862072 C 80887808 0 28381184 0 589 42898166 2343277 38538419 44733896
56586 58663654760238 S 0 0 0 0 0 45052412 74066 40572391 0
56587 58663699812650 W 0 0 0 0 0 45309404 733220 0 0
56588 58663745122054 O 0 0 0 0 0 0 0 0 0 Conclusion: Regarding the test with gzip, I believe it just bottle-necks the whole pipeline, so that rest of pipeline is irrelevant. |
I don't think this is too surprising since there's a non-trivial fixed cost per-txg. What it might help us avoid though are latency spikes caused by bursty writes creating abnormally large transaction groups. By chance is this something you looked at as well?
The story here is complicated. It wouldn't surprise me if this is the case for a freshly booted system with a lot of free memory. However, when there isn't much free memory on the system or it's heavily fragmented is where you'll start to see problems caused by slower allocation times and even potentially OOM events. Stability and consist performance were the main drivers for us switching to allocating scatter lists of pages. How would you like to move forward with this? |
I am working on an improved version. Not only it can keep 3 active txgs in the pipe line, but also it will try to maximize the size of each txg by given same zfs_dirty_data_max. In principle, that one should take care of the negative impact. |
From my experience, under normal use case, latency issue can be mitigated by choose a reasonable size of zfs_dirty_data_max. |
Here is the latest test based on my 2nd commit comparing to upstream. I changed a little bit of my test setup to get more constaint results. vm has 4G total memory. zpool is created with zpool create -o ashift=12 -O primarycache=metadata test /dev/vdb
zfs set compression=lz4 test .fio file
After patch txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
35679 69092572632647 C 80756736 0 57958400 0 646 40669578 43437 20614115 44124355
35680 69092613302225 C 80756736 0 57802752 0 522 47436725 68818 17283517 43646520
35681 69092660738950 C 56377344 0 40411136 0 328 58109274 21868 2872966 37731333
35682 69092718848224 C 80756736 0 58036224 0 615 35292948 44108 5294440 47980497
35683 69092754141172 C 75120640 0 53981184 0 613 50460510 31963 2834690 50602716
35684 69092804601682 C 87310336 0 62488576 0 677 44603987 4861175 4013901 44782167
35685 69092849205669 C 80756736 0 57933824 0 664 42186732 53151 11423956 53068521
35686 69092891392401 C 80756736 0 57847808 0 566 35091360 36302 29431026 46645059
35687 69092926483761 C 80756736 0 58060800 0 684 47961384 50720 28109134 47815868
35688 69092974445145 C 80756736 0 57933824 0 633 51824890 32915 24130930 50572332
35689 69093026270035 C 81018880 0 58187776 0 619 52857654 3141 21892758 51751870
35690 69093079127689 C 80756736 0 57823232 0 640 54874669 39900 18739796 57640744
35691 69093134002358 C 80756736 0 57958400 0 653 44241695 36307 32153019 52333859
35692 69093178244053 C 80756736 0 57974784 0 475 54879156 52565 29602071 36515528
35693 69093233123209 C 57819136 0 41521152 0 454 63174331 21161 2981065 40224102
35694 69093296297540 C 80756736 0 57790464 0 555 33685557 101633 9444723 46084255
35695 69093329983097 C 57294848 0 40947712 0 489 54311096 2533 1322765 39443299
35696 69093384294193 C 80756736 0 57950208 0 621 29815052 93420 10865597 44708670
35697 69093414109245 C 80756736 0 58032128 0 588 33851443 23461 21803430 44525663
35698 69093447960688 S 0 0 0 0 0 62570654 24644 3762074 0
35699 69093510531342 W 0 0 0 0 0 35269108 25155 0 0
35700 69093545800450 O 0 0 0 0 0 0 0 0 0 txg with "W" is observed. I did 4 batches run in total. after patch WRITE: bw=1519MiB/s (1593MB/s), 1519MiB/s-1519MiB/s (1593MB/s-1593MB/s), io=44.5GiB (47.8GB), run=30011-30011
WRITE: bw=1540MiB/s (1615MB/s), 1540MiB/s-1540MiB/s (1615MB/s-1615MB/s), io=45.1GiB (48.5GB), run=30007-30007
WRITE: bw=1523MiB/s (1597MB/s), 1523MiB/s-1523MiB/s (1597MB/s-1597MB/s), io=44.6GiB (47.9GB), run=30001-30001 before patch WRITE: bw=1580MiB/s (1657MB/s), 1580MiB/s-1580MiB/s (1657MB/s-1657MB/s), io=46.3GiB (49.7GB), run=30001-30001
WRITE: bw=1549MiB/s (1624MB/s), 1549MiB/s-1549MiB/s (1624MB/s-1624MB/s), io=45.4GiB (48.7GB), run=30001-30001
WRITE: bw=1553MiB/s (1629MB/s), 1553MiB/s-1553MiB/s (1629MB/s-1629MB/s), io=45.5GiB (48.9GB), run=30001-30001 after patch WRITE: bw=1532MiB/s (1606MB/s), 1532MiB/s-1532MiB/s (1606MB/s-1606MB/s), io=44.9GiB (48.2GB), run=30001-30001
WRITE: bw=1557MiB/s (1632MB/s), 1557MiB/s-1557MiB/s (1632MB/s-1632MB/s), io=45.6GiB (48.0GB), run=30002-30002
WRITE: bw=1564MiB/s (1640MB/s), 1564MiB/s-1564MiB/s (1640MB/s-1640MB/s), io=45.8GiB (49.2GB), run=30001-30001 before patch WRITE: bw=1522MiB/s (1596MB/s), 1522MiB/s-1522MiB/s (1596MB/s-1596MB/s), io=44.6GiB (47.9GB), run=30006-30006
WRITE: bw=1518MiB/s (1592MB/s), 1518MiB/s-1518MiB/s (1592MB/s-1592MB/s), io=44.5GiB (47.8GB), run=30001-30001
WRITE: bw=1533MiB/s (1607MB/s), 1533MiB/s-1533MiB/s (1607MB/s-1607MB/s), io=44.9GiB (48.2GB), run=30008-30008 From the number above, I can say they are about equal. I also did some test after put "iotune" (for io throttle) on the block device that passed through kvm. All the results converge to the same value within 1M/s deviation. It just proves the throttle is doing its job. I don't think those help in my case. Thus I did not include those results here. Since all the txg will have the same size, it does help smooth out write operation a lot. |
CentOS Stream 8 x86_64 (TEST) build slave failed at [ 9994.053914] VERIFY3(spa->spa_checkpoint_info.sci_dspace == ckpoint_sm_space_sum) failed (186398720 == 378475008)
[ 9994.055615] PANIC at spa_checkpoint.c:276:spa_checkpoint_accounting_verify() I am not sure if it related to this pull request. |
I like using dp_dirty_pertxg[] as min sync threshold instead of dp_dirty_total. dp_dirty_total use in txg_sync_thread() is OK only because there is no syncing or quesced TXG at that point, and so dp_dirty_total is effectively equal to dp_dirty_pertxg[]. In dsl_pool_need_dirty_delay() it is not true and indeed worth fixing. I also like the idea of preemptive quiescing. But looking on almost 30% of wtime (if I read it correctly) I think the specific math is broken. To reach maximum performance wtime should be minimal. From one side dirty data belonging to the waiting TXG are not being synced yet and so are not amortizing metadata updates. But from another they also can't be aggregated with new writes coming to the next open TXG. It is a waste from all sides. I think your problem is in 48% constant. Since IIRC dp_dirty_total is reduced with each block write complete, you should not start quiescing when new transaction reaches 48% of peak, because it does not mean that syncing previous TXG is almost committe, but actually that it still has amount of work to do equal to the amount in open TXG, and if you quiesce now -- the TXG will have to wait. Ideal would be some self-adapting mechanism, quiescing new transaction when previous syncing is getting close to completion, so that quiescing completed just in time for it. The easiest could be to bump your 49% to somewhere about 90-95%, and not from peak, but just from dp_dirty_total. But better I think would be to make the threshold increase each time when TXG has to wait and decrease each time sync thread has to wait for work. |
I just did some test. It is true that dp_dirty_total is reduced with each block write complete.
Thank you for your wonderful analysis. txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
175228 639992657529006 C 88883200 0 63811584 0 147 730511301 1944 48814 636039963
175229 639993388040307 C 97533952 0 70033408 0 150 73487156 1900 562623368 696947897
175230 639993461527463 C 93077504 0 66801664 0 145 1259601552 1588 4407969 670666142
175231 639994721129015 C 97533952 0 69980160 0 127 675090916 1895 13802 699094297
175232 639995396219931 C 93077504 0 66818048 0 141 699124773 1796 12220 668390812
175233 639996095344704 C 97533952 0 69992448 0 144 668450386 3069 55495 701103186
175234 639996763795090 C 93208576 0 66883584 0 139 701177058 62445 13914 667921150
175235 639997464972148 C 97009664 0 69705728 0 142 69438844 1936 598580995 697326573
175236 639997534410992 C 100679680 0 72155136 0 148 1299152263 8157 32047 712104026
175237 639998833563255 C 98320384 0 70606848 0 149 716013315 2158 52718 706274753
175238 639999549576570 C 92028928 0 66048000 0 146 706363871 1756 16623 662713913
175239 640000255940441 C 97665024 0 70115328 0 143 71044598 134969 591569599 693088396
175240 640000326985039 C 92946432 0 66707456 0 153 1284808543 5896 8703 672417051
175241 640001611793582 C 97665024 0 70086656 0 138 672454387 1897 8659 701585716
175242 640002284247969 C 92946432 0 66723840 0 147 701615445 2645 15924 666385440
175243 640002985863414 C 100679680 0 72290304 0 142 667681067 3388 30788 725934199
175244 640003653544481 C 89931776 0 64565248 0 141 725982977 2152 31483 640424032
175245 640004379527458 C 97140736 0 69722112 0 150 68105377 1794 572370525 698288949
175246 640004447632835 C 93470720 0 67076096 0 143 1270683632 2705 20113 671316514
175247 640005718316467 C 97271808 0 69828608 0 150 671375409 58159 30173 696761294
175248 640006389691876 C 93470720 0 67084288 0 145 696871322 85294 32266 670010312
175249 640007086563198 C 97140736 0 69713920 0 147 670145485 1947 15488 701376097
175250 640007756708683 C 93601792 0 67264512 0 140 701430659 47362 17639 669679903
175251 640008458139342 C 97140736 0 69611520 0 145 669828662 22301 17254 698234887
175252 640009127968004 C 93601792 0 67219456 0 146 698297046 64492 19134 670141928
175253 640009826265050 C 97140736 0 69722112 0 147 670257019 85959 27259 698071692
175254 640010496522069 C 93470720 0 67096576 0 136 698199954 1586 14777 669633049
175255 640011194722023 C 99631104 0 71524352 0 142 670862388 1923 28999 715478313
175256 640011865584411 S 0 0 0 0 0 716002427 2868 46705 0
175257 640012581586838 W 0 0 0 0 0 70924057 2243 0 0
175258 640012652510895 O 0 0 0 0 0 0 0 0 0
root@dev:~# From the test, 55% cannot reliably cause enough wtime. Even before this patch, the size of open txg is coupled with the size of previous syncing txg. (You can see "Before Patch" results in my previous test. ) The test results does not make sense to me. I am going to investigate this issue. 90%-95% is a good target to work on. |
To make results have more sense, could you please test this in more traditional and controllable environment: real hardware instead of VM, sufficient amount of memory (at least 8GB), sufficient number of CPU cores (I'd prefer my patch to reduce starvation and scheduler effects: #11966), no weird custom tunings, real HDDs and/or SSDs instead of some RAM disks with unrealistic characteristics, etc. |
Your peak value should grow immediately, while decay very slowly for dozens and hundreds of TXGs. In such scenario 95% should almost never be reached, while 48 or 55% may just appear in the middle of the range and be somehow effective, but according to large wtime suboptiomal. That is why I proposed to measure the percents from dp_dirty_total. It would also make the function continouous with "if (dirty == total)" if the percent increased to 100%. |
I will switch to dp_dirty_total instead of the "peak" just as you proposed. It should work. |
I have update the code and force pushed the 2nd commit. I use 95% of dp_dirty_total, and it works. txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
1565 5496303431850 C 0 0 0 0 0 5119959415 3724 147363 177040
1566 5501423391265 C 0 0 0 0 0 5119963671 2472 60575 135343
1567 5506543354936 C 53886976 94208 39202816 3 316 2860208257 2198 65569 297276216
1568 5509403563193 C 180240384 45056 129978368 1 313 224567505 68088 72719205 1299854320
1569 5509628130698 C 208814080 90112 150659072 2 345 1290474137 114683 82060832 1507674055
1570 5510918604835 C 208683008 45056 150470656 1 360 1502925069 114708 86833356 1505566546
1571 5512421529904 C 207765504 49152 149929984 2 367 1500377420 108795 92036122 1499341693
1572 5513921907324 C 209076224 0 150888448 0 382 1507714805 55855 83723656 1509124667
1573 5515429622129 C 208551936 0 150274048 0 348 1504038022 120483 88754240 1502270760
1574 5516933660151 C 208945152 0 150593536 0 336 1509271676 148127 81732913 1506392139
1575 5518442931827 C 208420864 0 150220800 0 361 1503087396 112930 85079606 1501720665
1576 5519946019223 C 208551936 0 150286336 0 347 1500980139 133760 85807463 1502374394
1577 5521446999362 C 207896576 0 149979136 0 365 1500125236 43547 88156854 1500720894
1578 5522947124598 C 208420864 0 150278144 0 349 1502709352 97334 86129509 1503328939
1579 5524449833950 C 208945152 0 150941696 0 405 1502723116 111111 86740203 1508737206
1580 5525952557066 C 207896576 0 150036480 0 385 1505874501 124159 89597580 1500005841
1581 5527458431567 C 208027648 0 150036480 0 345 1498451865 98345 91194253 1499577541
1582 5528956883432 C 208683008 8192 150700032 2 403 1507652237 187775 83036844 1508258005
1583 5530464535669 C 208420864 0 150503424 0 381 1503345262 80259 88073051 1505072490
1584 5531967880931 C 207503360 4096 149741568 1 372 1500752600 130817 92352989 1497138929
1585 5533468633531 C 208551936 0 150364160 0 337 1504274294 94290 85265121 1503351203
1586 5534972907825 C 208683008 0 150638592 0 352 1502115981 113372 86490165 1506278306
1587 5536475023806 C 208420864 0 150523904 0 361 1505932533 221217 86740773 1505880910
1588 5537980956339 C 208158720 0 150204416 0 359 1502386943 87400 90378672 1501457737
1589 5539483343282 C 208420864 0 150634496 0 374 1504677135 46652 87208316 1506916697
1590 5540988020417 C 209076224 0 150999040 0 369 1507227127 91725 86860868 1509600687
1591 5542495247544 C 208289792 0 150343680 0 371 1506282721 86007 90197136 1503223800
1592 5544001530265 S 0 0 0 0 0 1509079993 116448 84317422 0
1593 5545510610258 O 0 0 0 0 0 0 0 0 0
root@dev:~# |
5% of wtime is definitely better than 30%. Though it is still few orders of magnitude bigger than qtime. I am thinking whether we could get closer. Also I am curios how would it coexist with scrub, running in sync thread, but not touching dirty data counters (will it inflate wtime with this patch?). |
Have you actually seen qtime reaching significant times when the promised "improved write throughput" would become measurable? I am not opposing, just trying to better understand the scenarios and how big time reserve should there be. |
I rarely see qtime reaching 1ms. Assuming syncing time is 1s, at best, you will get 0.1% improvement, in theory. |
When recently working on prefetch, I've noticed that write path does additional reads before opening transaction (see dmu_tx_check_ioerr()). I suspect it could not always be that way, in which case I suppose txg quescing could take more time while all cache misses are resolved with synchronous reads. I wonder if there are still some cases not covered by such prefetch logic when transaction may need more time between assign and commit. And if there are, can we predict that to start quiescing reasonably in advance. |
I just force pushed the 2nd commit. Added a new module parameter zfs_txg_quiesce_advance to control the Below is a quick test with the default value zfs_txg_quiesce_advance = 0. Run with fio random write. txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
21726 103206225664986 C 0 245760 1519616 29 102 633218 795 10791 17687377
21727 103206226298204 C 0 0 0 0 0 24079202 1612 10784 77452
21728 103206250377406 C 0 0 1605632 0 105 6145616 1579 32674 8932632
21729 103206256523022 C 0 0 0 0 0 18879744 2748 40313 90109
21730 103206275402766 C 0 0 778240 0 58 5254805809 2917 122647 11993657
21731 103211530208575 C 57294848 20480 41906176 4 357 1756587977 4992557 19357 321394414
21732 103213286796552 C 221003776 0 159694848 0 381 315266593 2576 11146203 1596576140
21733 103213602063145 C 225984512 0 163201024 0 393 1596079432 2517 11651015 1633176640
21734 103215198142577 C 225984512 0 163164160 0 341 1632792692 2102 12046204 1632454902
21735 103216830935269 C 228605952 0 165183488 0 436 1644532751 74407 37124 1651635616
21736 103218475468020 C 226508800 0 163627008 0 454 1651943777 21304 42311 1635996946
21737 103220127411797 C 225460224 24576 162725888 2 402 1623015694 3171 13051151 1627796546
21738 103221750427491 C 225067008 0 162455552 0 417 1628279032 2409 12580216 1624712731
21739 103223378706523 C 226770944 0 163827712 0 382 1624882610 3176 12426656 1638119133
21740 103225003589133 C 228737024 0 165191680 0 357 1651203408 1301 33939 1651041132
21741 103226654792541 C 226377728 0 163581952 0 396 1651254195 1186885 38175 1635159934
21742 103228306046736 C 226115584 0 163246080 0 420 1619725232 135801 16539105 1631976661
21743 103229925771968 C 224673792 0 162148352 0 396 1635030035 3047 13630285 1621877783
21744 103231560802003 C 225460224 0 162836480 0 390 1621542461 4388 13975316 1627243231
21745 103233182344464 S 0 0 0 0 0 1629081387 2957 12151616 0
21746 103234811425851 O 0 0 0 0 0 0 0 0 0 |
From my latest test, the first txg (21726) is the the txg right after the pool is imported. |
5ms is obviously a spike, but I am not sure what is the cause of it. As I have told, zfs_write() via dmu_tx_hold_write_by_dnode() calls dmu_tx_check_ioerr() before opening transaction, that moves the metadata read from qtime earlier. I suspect if you add "return (0);" into dmu_tx_check_ioerr() you should see huge qtime increase. What is there now though I am not sure. It can be some memory allocation or some metadata access not covered by the above code, or the thread that opened transaction could be context switched before committing it. Or theoretically there could be some swap-in of data written, if they are not resident somehow. But all those are pretty small and unpredictable to take into account. |
This patch may introduce some race conditions. |
Added a new module parameter zfs_txg_quiesce_advance to control the timing of quiescing open txg. The open txg can be quiesced into the pipeline even there is a txg still syncing. When the dirty data in syncing txg is below zfs_txg_quiesce_advance, which also measns the sync is about to complete, quiesce the open txg into the pipeline. 0 value means only quiesce the open txg when all the data in the previous txg synced. When all dirty data is writen out, the sync thread may not return immediately. It may be blocked by some mutex lock. Thus, even with 0 value, it still can help minimize idle time of the sync stage. Recommended value: 0 - 5% of zfs_dirty_data_max in bytes. I also move txg_kick() call into dsl_pool_dirty_space() and dsl_pool_undirty_space(), so that txg_kick() can also be triggered by undirty routines. Addressed race condition of tx->tx_open_txg in txg_sync_thread(). Some other style fixes regarding code review. Signed-off-by: jxdking <[email protected]>
I have patched up the race condition of tx_open_txg in txg_sync_thread(). I just force pushed commit 6b40c56 |
So far, I am not able to verify the improved throughput claim through my test environment. |
In conclusion. |
Description
Current txg_kick() only works if there is no txg in either quiescing
or syncing. This commit allow txg_kick() to work as long as there is
no txg is quiescing.
This keeps syncing stage busy, which should benefit write throughput.
Also txg_kick() should be triggered on dp_dirty_pertxg[] instead of
dp_dirty_total, because dp_dirty_total won't be decreased until the dirty
txg is synced. This change prevents pushing a empty txg through pipeline
in a row.
Motivation and Context
It should improve write throughput, especially with SSD and NVMe pool.
However, the improvement on HDD based pool is not measurable, because quiescing stage is so fast comparing to HDD.
How Has This Been Tested?
Test with fio, doing sequential write. The testing VM is running with 2G memory. zfs_arc_min is bumped to 1G. Other parameters remain default.
Before patch:
After patch:
Types of changes
Checklist:
Signed-off-by
.