Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow txg_kick() even there is a txg still syncing #11929

Closed
wants to merge 2 commits into from

Conversation

jxdking
Copy link
Contributor

@jxdking jxdking commented Apr 22, 2021

Description

Current txg_kick() only works if there is no txg in either quiescing
or syncing. This commit allow txg_kick() to work as long as there is
no txg is quiescing.

This keeps syncing stage busy, which should benefit write throughput.

Also txg_kick() should be triggered on dp_dirty_pertxg[] instead of
dp_dirty_total, because dp_dirty_total won't be decreased until the dirty
txg is synced. This change prevents pushing a empty txg through pipeline
in a row.

Motivation and Context

It should improve write throughput, especially with SSD and NVMe pool.
However, the improvement on HDD based pool is not measurable, because quiescing stage is so fast comparing to HDD.

How Has This Been Tested?

Test with fio, doing sequential write. The testing VM is running with 2G memory. zfs_arc_min is bumped to 1G. Other parameters remain default.
Before patch:

root@dev:~# tail /proc/spl/kstat/zfs/test/txgs
71498    65043363629648   C     43925504     0            43208192     0        409      864390244    2828         29510        498654107
71499    65044228019892   C     101597184    0            100833280    0        842      498719485    2114         13211        619938757
71500    65044726739377   C     43925504     0            43409408     0        414      620010094    1782         19106        515227503
71501    65045346749471   C     101597184    0            100971520    0        840      515275769    1734         13760        698601092
71502    65045862025240   C     43925504     0            43195904     0        411      698653111    1877         4463859      396687228
71503    65046560678351   C     101597184    0            100919296    0        836      401169143    1166         6028         1805292783
71504    65046961847494   C     43925504     0            43279872     0        414      1805322266   1161         10955        383355578
71505    65048767169760   C     101728256    0            101112320    0        839      383452517    95278        408263       613475420
71506    65049150622277   S     0            0            0            0        0        613997275    1704         9604         0
71507    65049764619552   O     0            0            0            0        0        0            0            0            0

After patch:

root@dev:~# tail /proc/spl/kstat/zfs/test/txgs
70648    60985322448558   C     55721984     0            55094784     0        498      791474596    2859         267504511    925455339
70649    60986113923154   C     44318720     0            43598336     0        413      267580813    307888       925133065    201306178
70650    60986381503967   C     47726592     0            47099904     0        440      925436392    1316         201323924    708992030
70651    60987306940359   C     55590912     0            54942208     0        492      201664659    6605516      702092920    232786526
70652    60987508605018   C     44318720     0            43713536     0        416      708832391    279971       232420066    262959123
70653    60988217437409   C     47857664     0            47221760     0        444      232834897    4869         262833639    1175621968
70654    60988450272306   C     55328768     0            54640128     0        490      272623304    3986242      1161879733   435899211
70655    60988722895610   S     0            0            0            0        0        1165884947   109646       435806457    0
70656    60989888780557   W     0            0            0            0        0        436512899    1850207      0            0
70657    60990325293456   O     0            0            0            0        0        0            0            0            0

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Sorry, something went wrong.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Apr 22, 2021
Current txg_kick() only works if there is no txg in either quiescing
or syncing. This commit allow txg_kick() to work as long as there is
no txg is quiescing.

This keeps syncing stage busy, which should benefit write throughput.

Also txg_kick() should be triggered on dp_dirty_pertxg[] instead of
dp_dirty_total, because dp_dirty_total won't be decreased until the
dirty txg is synced. This change prevent pushing a empty txg through
pipeline in a row.

Signed-off-by: jxdking <[email protected]>
@jxdking jxdking changed the title Allow kick_txg() even there is a txg still syncing Allow txg_kick() even there is a txg still syncing Apr 22, 2021
@jxdking jxdking marked this pull request as ready for review April 22, 2021 23:40
@adamdmoss
Copy link
Contributor

FYI I've been running with this for a couple of days - I haven't benchmarked it but it's been behaving fine so far.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle this makes good sense to me to help smooth out the txgs.

module/zfs/dmu_tx.c Outdated Show resolved Hide resolved
module/zfs/dsl_pool.c Outdated Show resolved Hide resolved
module/zfs/txg.c Outdated Show resolved Hide resolved
module/zfs/txg.c Show resolved Hide resolved
@jxdking
Copy link
Contributor Author

jxdking commented Apr 27, 2021

In principle this makes good sense to me to help smooth out the txgs.

The test below is based on my first commit. Conclusion is at end of this post.
My 2nd commit has fixed the issue mentioned below regarding throughput.

setup:
KVM client, 12GB Memory, with 8 threads(Haswell)
Ubuntu groovy
8GB zpool on file. The file is on 8GB ramdisk.
Created with

zpool create -o ashift=12 -O primarycache=metadata test /tmp/mytempfs/d1

Parameter

echo $(expr 1024 \* 1024 \* 256) >> /sys/module/zfs/parameters/zfs_dirty_data_max
echo $(expr 1024 \* 1024 \* 1024) >> /sys/module/zfs/parameters/zfs_arc_min
echo 0 >> /sys/module/zfs/parameters/zfs_abd_scatter_enabled
echo 30 >> /sys/module/zfs/parameters/zfs_dirty_data_sync_percent

Note: zfs_dirty_data_sync_percent is set to 30%. I found there is fixed overhead to sync a txg regardless the txg size. Basically, you want to keep txg size as large as possible, so that you can use less number of txgs to get the job done. However the dirty data is soft capped by zfs_delay_min_dirty_percent (60%), and you also want at least 2 txgs in the pipeline at the same time, one for syncing, one for open. Thus, 30% might be the optimal value.

.fio file:

; fio-seq-write.job for fiotest

[global]
name=fio-seq-write
filename=/test/fio-seq-write
rw=write
bs=128K
direct=1
numjobs=32
time_based=1
runtime=30
buffer_compress_percentage=70

[file1]
size=4096M
ioengine=libaio
iodepth=16

before patch
(I have removed unnecessary lines)

  WRITE: bw=2312MiB/s (2425MB/s), 62.5MiB/s-93.4MiB/s (65.5MB/s-97.9MB/s), io=67.8GiB (72.8GB), run=30001-30036msec
  WRITE: bw=2364MiB/s (2479MB/s), 63.3MiB/s-100MiB/s (66.3MB/s-105MB/s), io=69.3GiB (74.5GB), run=30001-30037msec
  WRITE: bw=2366MiB/s (2481MB/s), 62.9MiB/s-96.8MiB/s (65.9MB/s-102MB/s), io=69.4GiB (74.5GB), run=30001-30047msec
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
34469    44413228261394   C     81412096     0            80900096     0        680      38990172     90459        39504        31888244
34470    44413267251566   C     80756736     0            80183296     0        676      35474837     95056        15563        33482372
34471    44413302726403   C     81543168     0            81043456     0        678      36462139     6058754      20627        26053420
34472    44413339188542   C     81412096     0            80924672     0        679      35180551     5144063      16541        31480321
34473    44413374369093   C     80756736     0            80134144     0        662      41049444     297837       14129        32115820
34474    44413415418537   C     80887808     0            80412672     0        679      35711703     117254       17481        33287954
34475    44413451130240   C     80756736     45056        80269312     1        679      37546015     221847       20722        37446051
34476    44413488676255   C     81412096     0            80912384     0        677      41115482     4964932      44709        28192250
34477    44413529791737   C     81674240     0            81088512     0        681      36332028     114537       23074        31926800
34478    44413566123765   C     81412096     0            80842752     0        680      35750331     145100       39345        34063030
34479    44413601874096   C     81412096     0            80936960     0        686      37437573     124580       13599        34473612
34480    44413639311669   C     81018880     0            80384000     0        669      38736130     122841       34216        34415495
34481    44413678047799   C     81018880     0            80433152     0        673      38211708     219022       10173        32172248
34482    44413716259507   C     81018880     0            80523264     0        673      35520291     232346       18448        31529957
34483    44413751779798   C     80625664     0            80056320     0        672      34776822     194787       13009        32898429
34484    44413786556620   C     81412096     0            80928768     0        677      36494459     108806       18843        32076272
34485    44413823051079   C     80625664     0            80105472     0        670      35260730     59518        32022        31544043
34486    44413858311809   C     80887808     0            80453632     0        675      34858642     5691509      13785        27406610
34487    44413893170451   S     0            0            0            0        0        36041951     206593       22816        0
34488    44413929212402   O     0            0            0            0        0        0            0            0            0

after patch

  WRITE: bw=2367MiB/s (2482MB/s), 64.3MiB/s-88.4MiB/s (67.4MB/s-92.7MB/s), io=69.5GiB (74.6GB), run=30001-30068msec
  WRITE: bw=2397MiB/s (2514MB/s), 64.3MiB/s-93.2MiB/s (67.4MB/s-97.7MB/s), io=70.3GiB (75.5GB), run=30001-30029msec
  WRITE: bw=2359MiB/s (2474MB/s), 62.1MiB/s-96.4MiB/s (65.1MB/s-101MB/s), io=69.2GiB (74.3GB), run=30001-30041msec
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
38602    45256675295546   C     81281024     0            80781312     0        682      37331366     327246       31073        32381340
38603    45256712626912   C     81281024     0            80658432     0        676      35745088     5685674      42948        26099076
38604    45256748372000   C     81281024     0            80605184     0        672      35411762     7509642      156310       29021598
38605    45256783783762   C     81412096     0            80834560     0        692      39907295     180550       37408        32893259
38606    45256823691057   C     81543168     0            81014784     0        680      36284749     6010436      106064       26588395
38607    45256859975806   C     81412096     0            80736256     0        678      36014189     6982617      27029        23381964
38608    45256895989995   C     81018880     0            80478208     0        678      33998979     244077       43094        30736135
38609    45256929988974   C     81281024     0            80789504     0        682      34759192     5729245      31807        25797541
38610    45256964748166   C     80887808     0            80359424     0        675      34697286     151429       29071        30306521
38611    45256999445452   C     81281024     0            80605184     0        669      33319205     5195582      23199        25388281
38612    45257032764657   C     80887808     0            80236544     0        670      33716458     108571       292652       30580742
38613    45257066481115   C     81281024     0            80629760     0        681      34141174     5460094      213898       25590267
38614    45257100622289   C     81149952     0            80486400     0        678      34771317     8126630      35692        24200981
38615    45257135393606   C     80756736     0            80203776     0        677      35418189     7796445      13185        25106909
38616    45257170811795   C     80887808     0            80248832     0        672      36260117     185389       21869        29588025
38617    45257207071912   C     81149952     0            80596992     0        682      32665857     2732853      66327        29102961
38618    45257239737769   C     81149952     0            80474112     0        669      34803758     3981020      33005        28281408
38619    45257274541527   C     81412096     0            80748544     0        677      35385203     118840       17162        32180451
38620    45257309926730   S     0            0            0            0        0        35474999     114903       26184        0
38621    45257345401729   O     0            0            0            0        0        0            0            0            0

Note that the wtime remains small after the patch. There is no txg with "W" status observed, which suggest the fill rate of open txg is slower than sync rate. The open stage is the bottle neck of the whole pipeline. The fio results between "before" and "after" patches are within margin of error. I can say the overhead of extra mutex lock can be ignored.

I accidentally find "abd scatter" has negative impact on performance by 20 percent. See below.
after patch (abd scatter enabled)

  WRITE: bw=1966MiB/s (2062MB/s), 51.9MiB/s-87.4MiB/s (54.4MB/s-91.7MB/s), io=57.7GiB (61.9GB), run=30001-30044msec
  WRITE: bw=1924MiB/s (2017MB/s), 51.1MiB/s-79.6MiB/s (53.5MB/s-83.5MB/s), io=56.5GiB (60.6GB), run=30001-30052msec
  WRITE: bw=1907MiB/s (1999MB/s), 50.6MiB/s-80.2MiB/s (53.1MB/s-84.1MB/s), io=55.9GiB (60.1GB), run=30001-30044msec
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
45112    46616059350565   C     81281024     0            80605184     0        675      46312283     219458       14457        42556757
45113    46616105662848   C     81805312     0            81190912     0        684      47335452     359614       21052        41163965
45114    46616152998300   C     81149952     0            80678912     0        679      46887698     12228973     39583        29266544
45115    46616199885998   C     57688064     0            57012224     0        489      12512219     8332132      20771395     42377446
45116    46616212398217   C     81805312     0            81141760     0        683      71568296     3383276      24245        36827690
45117    46616283966513   C     81149952     0            80539648     0        678      44743970     10766099     32691        31341262
45118    46616328710483   C     81149952     0            80527360     0        674      46646751     220928       21371        40313402
45119    46616375357234   C     81149952     0            80527360     0        676      45575267     10715816     20249        29692062
45120    46616420932501   C     81281024     0            80748544     0        680      44971173     236248       25576        42036086
45121    46616465903674   C     81149952     0            80637952     0        680      46693765     214532       25288        43491260
45122    46616512597439   C     81281024     0            80760832     0        676      48310498     96684        6341685      35546532
45123    46616560907937   C     81149952     0            80588800     0        682      46402936     207012       22675        43780165
45124    46616607310873   C     81281024     0            80703488     0        683      48462547     13665952     22716        30395226
45125    46616655773420   C     57294848     0            56705024     0        494      13924883     224172       30015892     39373770
45126    46616669698303   C     101728256    0            101126144    0        835      73702605     4462872      27321        41168582
45127    46616743400908   C     80887808     0            80199680     0        669      55312100     7756103      49028        35153178
45128    46616798713008   C     81281024     0            80719872     0        676      47073669     408426       12489        40645494
45129    46616845786677   C     81149952     0            80637952     0        681      46721243     2543877      106690       38293195
45130    46616892507920   S     0            0            0            0        0        46271377     8687127      33222        0
45131    46616938779297   O     0            0            0            0        0        0            0            0            0

Now, I try to slow down the sync stage by adding gzip compression. I put buffer_compress_percentage=70 into .fio file to make sure the compression is kicked in.

before patch compression=gzip

  WRITE: bw=542MiB/s (568MB/s), 13.6MiB/s-23.6MiB/s (14.2MB/s-24.8MB/s), io=16.1GiB (17.2GB), run=30270-30358msec
  WRITE: bw=496MiB/s (520MB/s), 11.0MiB/s-24.9MiB/s (12.5MB/s-26.1MB/s), io=14.6GiB (15.7GB), run=30176-30189msec
  WRITE: bw=507MiB/s (532MB/s), 12.8MiB/s-26.9MiB/s (13.4MB/s-28.2MB/s), io=14.0GiB (16.1GB), run=30182-30191msec
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
59754    62524099457587   C     81018880     0            25616384     0        680      563753583    604707       26596        305528904
59755    62524663211170   C     140787712    0            44441600     0        1136     306248761    3026303      22437        497137187
59756    62524969459931   C     81018880     0            25714688     0        700      500277256    156580       1447246      314366264
59757    62525469737187   C     151011328    0            47648768     0        1220     322599153    3290237      30574        526694584
59758    62525792336340   C     80887808     0            25722880     0        679      533332264    4337         57648        299709676
59759    62526325668604   C     141574144    0            44552192     0        1141     299859969    128281       5261         495275433
59760    62526625528573   C     91111424     0            28794880     0        766      498822704    3994640      31961        326383169
59761    62527124351277   C     131088384    0            41373696     0        1075     330498419    146105       11557        464895527
59762    62527454849696   C     91111424     0            29016064     0        783      465150176    150680       9680         344827363
59763    62527919999872   C     130957312    0            41590784     0        1088     345078704    145752       8869         467954698
59764    62528265078576   C     91111424     0            28917760     0        773      468198146    109856       21032        329253132
59765    62528733276722   C     133709824    0            42364928     0        1086     330848546    527090       53951        468862927
59766    62529064125268   C     88489984     0            27975680     0        745      469540196    152291       4839014      321392357
59767    62529533665464   C     133709824    0            42094592     0        1072     326470809    100641       23589        481451266
59768    62529860136273   C     88621056     0            28028928     0        753      481675727    153903       8546         329382950
59769    62530341812000   C     133709824    0            42168320     0        1083     329632969    109922       6349364      471987117
59770    62530671444969   C     92422144     0            29364224     0        758      479269914    95419        22266        338497154
59771    62531150714883   C     137510912    0            43356160     0        1116     340680633    59653        23560        479593707
59772    62531491395516   S     0            0            0            0        0        479781560    106079       24871        0
59773    62531971177076   O     0            0            0            0        0        0            0            0            0

after patch compression=gzip

  WRITE: bw=542MiB/s (568MB/s), 14.5MiB/s-22.0MiB/s (15.2MB/s-24.1MB/s), io=15.9GiB (17.1GB), run=30001-30026msec
  WRITE: bw=491MiB/s (515MB/s), 12.9MiB/s-23.8MiB/s (13.5MB/s-24.9MB/s), io=14.4GiB (15.5GB), run=30076-30087msec
  WRITE: bw=506MiB/s (531MB/s), 13.7MiB/s-22.8MiB/s (14.3MB/s-23.0MB/s), io=14.9GiB (15.0GB), run=30088-30103msec
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
60351    63292441072166   C     80625664     0            25665536     0        703      297278152    86551        286099131    293857052
60352    63292738350318   C     80625664     0            25432064     0        684      293177147    6684494      280244100    316954118
60353    63293031527465   C     80756736     0            25620480     0        692      298168675    2631230      303171202    289437707
60354    63293329696140   C     80756736     0            25620480     0        690      311718638    93003        283509921    292115530
60355    63293641414778   C     80756736     0            25546752     0        704      292827654    16694        282958337    291062908
60356    63293934242432   C     80756736     0            25473024     0        689      291490036    17600488     265024507    294677914
60357    63294225732468   C     81018880     0            25739264     0        695      290111552    147367       287134821    295117765
60358    63294515844020   C     80756736     0            25473024     0        690      293783917    40360        288662013    292821039
60359    63294809627937   C     80756736     0            25460736     0        679      294224677    37767        287350163    295239715
60360    63295103852614   C     80887808     0            25587712     0        706      292917223    279658       289516676    289092422
60361    63295396769837   C     81018880     0            25579520     0        693      296995015    883183       281102345    306474454
60362    63295693764852   C     80625664     0            25444352     0        668      287265507    65619        301226894    311818433
60363    63295981030359   C     80756736     0            25595904     0        696      319446207    9104382      284642452    303984653
60364    63296300476566   C     81281024     0            25833472     0        691      297675424    59914        300081102    291334517
60365    63296598151990   C     80887808     0            25661440     0        688      304402514    41802        287117406    290045658
60366    63296902554504   C     80756736     0            25620480     0        690      294605822    31144863     251541507    296751848
60367    63297197160326   C     80756736     0            25460736     0        684      287862885    39212        291608968    291407922
60368    63297485023211   S     0            0            0            0        0        302319539    717188       280107850    0
60369    63297787342750   W     0            0            0            0        0        285875039    98920        0            0
60370    63298073217789   O     0            0            0            0        0        0            0            0            0

A txg with "W" status is observed. However, there is no difference on performance number.

Then I try lz4.
before patch compression=lz4

  WRITE: bw=2131MiB/s (2235MB/s), 58.0MiB/s-90.8MiB/s (60.9MB/s-95.2MB/s), io=62.5GiB (67.1GB), run=30001-30009msec
  WRITE: bw=2137MiB/s (2241MB/s), 58.3MiB/s-82.7MiB/s (61.1MB/s-86.7MB/s), io=62.7GiB (67.3GB), run=30001-30033msec
  WRITE: bw=2120MiB/s (2223MB/s), 57.3MiB/s-85.4MiB/s (60.1MB/s-89.6MB/s), io=62.2GiB (66.7GB), run=30001-30017msec
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
59497    62295821826121   C     156778496    0            54571008     0        735      53757120     1050416      23781        60254898
59498    62295875583241   C     80887808     0            28323840     0        410      64262346     133175       22957        52541648
59499    62295939845587   C     155992064    0            54206464     0        772      55745896     3385100      28723        63430302
59500    62295995591483   C     80756736     0            28360704     0        573      69832912     293554       40418        55024905
59501    62296065424395   C     154550272    0            53481472     0        718      58163148     4644850      27067        57527882
59502    62296123587543   C     80625664     0            28073984     0        452      64489090     47688        66250        51760139
59503    62296188076633   C     154419200    0            53702656     0        726      55520364     4505138      24659        53914306
59504    62296243596997   C     80887808     0            28303360     0        441      60844367     109522       12151        52066377
59505    62296304441364   C     155992064    0            54095872     0        765      55140233     66336        61905        60298365
59506    62296359581597   C     81412096     0            28467200     0        439      62989339     60392        18726        52740479
59507    62296422570936   C     149438464    0            51798016     0        933      52912442     2066965      668097       51618080
59508    62296475483378   C     81149952     0            28229632     0        511      55958937     115419       20038        44683512
59509    62296531442315   C     139870208    0            48553984     0        744      44923978     112406       7146         58676368
59510    62296576366293   C     103563264    0            35975168     0        601      63217211     12895231     30225        42121597
59511    62296639583504   C     142622720    0            49623040     0        663      60099689     5935466      29119        55191232
59512    62296699683193   C     90456064     0            31752192     0        612      63901173     4079559      32528        46513238
59513    62296763584366   C     136593408    0            47538176     0        846      51986673     109868       12668        57109625
59514    62296815571039   C     92815360     0            32604160     0        626      58741366     129464       25406        56746225
59515    62296874312405   S     0            0            0            0        0        61258339     5083757      24812        0
59516    62296935570744   O     0            0            0            0        0        0            0            0            0

after patch compression=lz4

  WRITE: bw=1923MiB/s (2017MB/s), 52.6MiB/s-78.2MiB/s (55.2MB/s-82.0MB/s), io=56.4GiB (60.5GB), run=30001-30015msec
  WRITE: bw=1955MiB/s (2050MB/s), 55.5MiB/s-74.6MiB/s (58.2MB/s-78.3MB/s), io=57.3GiB (61.5GB), run=30001-30019msec
  WRITE: bw=1945MiB/s (2039MB/s), 53.1MiB/s-76.9MiB/s (55.6MB/s-80.7MB/s), io=57.0GiB (61.2GB), run=30001-30014msec
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
56569    58662912459068   C     80887808     0            28545024     0        491      43956263     223894       38125337     40469058
56570    58662956415331   C     81018880     0            28479488     0        578      42542788     149993       36205815     40886703
56571    58662998958119   C     81412096     0            28651520     0        458      40867326     6941001      29516076     41789555
56572    58663039825445   C     80887808     0            28307456     0        516      41185442     111783       37033506     43187297
56573    58663081010887   C     81149952     0            28532736     0        612      41650415     18136308     20630497     54277163
56574    58663122661302   C     81018880     0            28413952     0        516      43880870     4484189      44775201     46459031
56575    58663166542172   C     81149952     0            28364800     0        497      54876440     1460653      39458765     41243889
56576    58663221418612   C     80887808     0            28237824     0        510      44953578     198680       37118188     41593651
56577    58663266372190   C     80887808     0            28262400     0        555      41440694     3278676      34275200     47494209
56578    58663307812884   C     81018880     0            28319744     0        671      42433606     2265634      40429803     45357353
56579    58663350246490   C     81018880     0            28229632     0        536      47684134     1078895      39393881     40493091
56580    58663397930624   C     81412096     0            28565504     0        476      44713983     179891       36148825     43762076
56581    58663442644607   C     81018880     0            28340224     0        536      40719026     2519254      36937317     41403806
56582    58663483363633   C     81018880     0            28356608     0        447      43673183     2848838      34422591     41548927
56583    58663527036816   C     81281024     0            28291072     0        417      41327437     244554       37329331     44122044
56584    58663568364253   C     80756736     0            28323840     0        483      43497819     9150281      29134688     45397085
56585    58663611862072   C     80887808     0            28381184     0        589      42898166     2343277      38538419     44733896
56586    58663654760238   S     0            0            0            0        0        45052412     74066        40572391     0
56587    58663699812650   W     0            0            0            0        0        45309404     733220       0            0
56588    58663745122054   O     0            0            0            0        0        0            0            0            0

Conclusion:
It seems that adding more txgs into the pipeline has negative effect on the throughput. (I tried couple of times after this. The results is repeatable. )
One possible explaination is, before the patch, the open txg will be stuck in open state. It will grow until the zfs_dirty_data_max is reached or the previous txg is synced. It often grows much larger than zfs_dirty_data_sync_percent. --Larger txg implies more performance?

Regarding the test with gzip, I believe it just bottle-necks the whole pipeline, so that rest of pipeline is irrelevant.

@behlendorf
Copy link
Contributor

It seems that adding more txgs into the pipeline has negative effect on the throughput. (I tried couple of times after this. The results is repeatable. )

I don't think this is too surprising since there's a non-trivial fixed cost per-txg. What it might help us avoid though are latency spikes caused by bursty writes creating abnormally large transaction groups. By chance is this something you looked at as well?

I accidentally find "abd scatter" has negative impact on performance by 20 percent.

The story here is complicated. It wouldn't surprise me if this is the case for a freshly booted system with a lot of free memory. However, when there isn't much free memory on the system or it's heavily fragmented is where you'll start to see problems caused by slower allocation times and even potentially OOM events. Stability and consist performance were the main drivers for us switching to allocating scatter lists of pages.

How would you like to move forward with this?

@jxdking
Copy link
Contributor Author

jxdking commented Apr 29, 2021

How would you like to move forward with this?

I am working on an improved version. Not only it can keep 3 active txgs in the pipe line, but also it will try to maximize the size of each txg by given same zfs_dirty_data_max. In principle, that one should take care of the negative impact.
I would like to keep this pull request going, at least for now.

@behlendorf behlendorf added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Apr 29, 2021
@jxdking
Copy link
Contributor Author

jxdking commented Apr 30, 2021

I don't think this is too surprising since there's a non-trivial fixed cost per-txg. What it might help us avoid though are latency spikes caused by bursty writes creating abnormally large transaction groups. By chance is this something you looked at as well?

From my experience, under normal use case, latency issue can be mitigated by choose a reasonable size of zfs_dirty_data_max.
However, I found there is one exception that may also cause abnormal latency. That's fsync. If you overwrite a small disk space over and over, the dirty space never reaches zfs_dirty_data_sync_percent. The write is cached in memory and not throttled, which can be 2GB/s. Once user issue fsync within txg timeout (about 5s). It may cause huge io delay if slog device cannot keep up.
I just happen to have another pull request #11912 is about this issue.

@jxdking
Copy link
Contributor Author

jxdking commented May 1, 2021

Here is the latest test based on my 2nd commit comparing to upstream. I changed a little bit of my test setup to get more constaint results.
I believe the performance degradation on previous test has been fixed
I setup a 8G file on ramdisk on the host side instead of the original test on client side. Then I passthrough the file as block device into kvm client side. Because of overhead of "passthrough", the throughput number will be less than the previous test.

vm has 4G total memory.
zfs_dirty_data_max is set to 256MB
zfs_arc_min is set to 1GB
zfs_abd_scatter_enabled is set to 0
Other parameters remain default

zpool is created with

zpool create -o ashift=12 -O primarycache=metadata test /dev/vdb
zfs set compression=lz4 test

.fio file

; fio-seq-write.job for fiotest

[global]
name=fio-seq-write
filename=/test/fio-seq-write
rw=write
bs=128K
direct=1
numjobs=1
time_based=1
runtime=30
buffer_compress_percentage=30

[file1]
size=4096M
ioengine=libaio
iodepth=16

After patch

txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
35679    69092572632647   C     80756736     0            57958400     0        646      40669578     43437        20614115     44124355
35680    69092613302225   C     80756736     0            57802752     0        522      47436725     68818        17283517     43646520
35681    69092660738950   C     56377344     0            40411136     0        328      58109274     21868        2872966      37731333
35682    69092718848224   C     80756736     0            58036224     0        615      35292948     44108        5294440      47980497
35683    69092754141172   C     75120640     0            53981184     0        613      50460510     31963        2834690      50602716
35684    69092804601682   C     87310336     0            62488576     0        677      44603987     4861175      4013901      44782167
35685    69092849205669   C     80756736     0            57933824     0        664      42186732     53151        11423956     53068521
35686    69092891392401   C     80756736     0            57847808     0        566      35091360     36302        29431026     46645059
35687    69092926483761   C     80756736     0            58060800     0        684      47961384     50720        28109134     47815868
35688    69092974445145   C     80756736     0            57933824     0        633      51824890     32915        24130930     50572332
35689    69093026270035   C     81018880     0            58187776     0        619      52857654     3141         21892758     51751870
35690    69093079127689   C     80756736     0            57823232     0        640      54874669     39900        18739796     57640744
35691    69093134002358   C     80756736     0            57958400     0        653      44241695     36307        32153019     52333859
35692    69093178244053   C     80756736     0            57974784     0        475      54879156     52565        29602071     36515528
35693    69093233123209   C     57819136     0            41521152     0        454      63174331     21161        2981065      40224102
35694    69093296297540   C     80756736     0            57790464     0        555      33685557     101633       9444723      46084255
35695    69093329983097   C     57294848     0            40947712     0        489      54311096     2533         1322765      39443299
35696    69093384294193   C     80756736     0            57950208     0        621      29815052     93420        10865597     44708670
35697    69093414109245   C     80756736     0            58032128     0        588      33851443     23461        21803430     44525663
35698    69093447960688   S     0            0            0            0        0        62570654     24644        3762074      0
35699    69093510531342   W     0            0            0            0        0        35269108     25155        0            0
35700    69093545800450   O     0            0            0            0        0        0            0            0            0

txg with "W" is observed.

I did 4 batches run in total.

after patch

  WRITE: bw=1519MiB/s (1593MB/s), 1519MiB/s-1519MiB/s (1593MB/s-1593MB/s), io=44.5GiB (47.8GB), run=30011-30011
  WRITE: bw=1540MiB/s (1615MB/s), 1540MiB/s-1540MiB/s (1615MB/s-1615MB/s), io=45.1GiB (48.5GB), run=30007-30007
  WRITE: bw=1523MiB/s (1597MB/s), 1523MiB/s-1523MiB/s (1597MB/s-1597MB/s), io=44.6GiB (47.9GB), run=30001-30001

before patch

  WRITE: bw=1580MiB/s (1657MB/s), 1580MiB/s-1580MiB/s (1657MB/s-1657MB/s), io=46.3GiB (49.7GB), run=30001-30001
  WRITE: bw=1549MiB/s (1624MB/s), 1549MiB/s-1549MiB/s (1624MB/s-1624MB/s), io=45.4GiB (48.7GB), run=30001-30001
  WRITE: bw=1553MiB/s (1629MB/s), 1553MiB/s-1553MiB/s (1629MB/s-1629MB/s), io=45.5GiB (48.9GB), run=30001-30001

after patch

  WRITE: bw=1532MiB/s (1606MB/s), 1532MiB/s-1532MiB/s (1606MB/s-1606MB/s), io=44.9GiB (48.2GB), run=30001-30001
  WRITE: bw=1557MiB/s (1632MB/s), 1557MiB/s-1557MiB/s (1632MB/s-1632MB/s), io=45.6GiB (48.0GB), run=30002-30002
  WRITE: bw=1564MiB/s (1640MB/s), 1564MiB/s-1564MiB/s (1640MB/s-1640MB/s), io=45.8GiB (49.2GB), run=30001-30001

before patch

  WRITE: bw=1522MiB/s (1596MB/s), 1522MiB/s-1522MiB/s (1596MB/s-1596MB/s), io=44.6GiB (47.9GB), run=30006-30006
  WRITE: bw=1518MiB/s (1592MB/s), 1518MiB/s-1518MiB/s (1592MB/s-1592MB/s), io=44.5GiB (47.8GB), run=30001-30001
  WRITE: bw=1533MiB/s (1607MB/s), 1533MiB/s-1533MiB/s (1607MB/s-1607MB/s), io=44.9GiB (48.2GB), run=30008-30008

From the number above, I can say they are about equal.
In this patch, I keep average txg size at least 95% big as the vanilla implementation. The remaining 5% is enough to squeeze in an extra txg partially. Also I literally counted the number of txgs comsumed by hand before and after the patch, they are about equal during the fio run. Thus the performance is expected to be equal if not better.

I also did some test after put "iotune" (for io throttle) on the block device that passed through kvm. All the results converge to the same value within 1M/s deviation. It just proves the throttle is doing its job. I don't think those help in my case. Thus I did not include those results here.

Since all the txg will have the same size, it does help smooth out write operation a lot.

@jxdking jxdking requested a review from behlendorf May 3, 2021 14:26
@jxdking
Copy link
Contributor Author

jxdking commented May 9, 2021

CentOS Stream 8 x86_64 (TEST) build slave failed at

[ 9994.053914] VERIFY3(spa->spa_checkpoint_info.sci_dspace == ckpoint_sm_space_sum) failed (186398720 == 378475008)
[ 9994.055615] PANIC at spa_checkpoint.c:276:spa_checkpoint_accounting_verify()

I am not sure if it related to this pull request.

@amotin
Copy link
Member

amotin commented May 12, 2021

I like using dp_dirty_pertxg[] as min sync threshold instead of dp_dirty_total. dp_dirty_total use in txg_sync_thread() is OK only because there is no syncing or quesced TXG at that point, and so dp_dirty_total is effectively equal to dp_dirty_pertxg[]. In dsl_pool_need_dirty_delay() it is not true and indeed worth fixing.

I also like the idea of preemptive quiescing. But looking on almost 30% of wtime (if I read it correctly) I think the specific math is broken. To reach maximum performance wtime should be minimal. From one side dirty data belonging to the waiting TXG are not being synced yet and so are not amortizing metadata updates. But from another they also can't be aggregated with new writes coming to the next open TXG. It is a waste from all sides. I think your problem is in 48% constant. Since IIRC dp_dirty_total is reduced with each block write complete, you should not start quiescing when new transaction reaches 48% of peak, because it does not mean that syncing previous TXG is almost committe, but actually that it still has amount of work to do equal to the amount in open TXG, and if you quiesce now -- the TXG will have to wait.

Ideal would be some self-adapting mechanism, quiescing new transaction when previous syncing is getting close to completion, so that quiescing completed just in time for it. The easiest could be to bump your 49% to somewhere about 90-95%, and not from peak, but just from dp_dirty_total. But better I think would be to make the threshold increase each time when TXG has to wait and decrease each time sync thread has to wait for work.

@amotin amotin self-requested a review May 12, 2021 21:29
@jxdking
Copy link
Contributor Author

jxdking commented May 13, 2021

Since IIRC dp_dirty_total is reduced with each block write complete, you should not start quiescing when new transaction reaches 48% of peak, because it does not mean that syncing previous TXG is almost committe, but actually that it still has amount of work to do equal to the amount in open TXG, and if you quiesce now -- the TXG will have to wait.

I just did some test. It is true that dp_dirty_total is reduced with each block write complete.

Ideal would be some self-adapting mechanism, quiescing new transaction when previous syncing is getting close to completion, so that quiescing completed just in time for it. The easiest could be to bump your 49% to somewhere about 90-95%, and not from peak, but just from dp_dirty_total. But better I think would be to make the threshold increase each time when TXG has to wait and decrease each time sync thread has to wait for work.

Thank you for your wonderful analysis.
In theory, 90%-95% of the peak value should work.
To verify this, I increase it to 55% from 48%. See results below.

txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
175228   639992657529006  C     88883200     0            63811584     0        147      730511301    1944         48814        636039963
175229   639993388040307  C     97533952     0            70033408     0        150      73487156     1900         562623368    696947897
175230   639993461527463  C     93077504     0            66801664     0        145      1259601552   1588         4407969      670666142
175231   639994721129015  C     97533952     0            69980160     0        127      675090916    1895         13802        699094297
175232   639995396219931  C     93077504     0            66818048     0        141      699124773    1796         12220        668390812
175233   639996095344704  C     97533952     0            69992448     0        144      668450386    3069         55495        701103186
175234   639996763795090  C     93208576     0            66883584     0        139      701177058    62445        13914        667921150
175235   639997464972148  C     97009664     0            69705728     0        142      69438844     1936         598580995    697326573
175236   639997534410992  C     100679680    0            72155136     0        148      1299152263   8157         32047        712104026
175237   639998833563255  C     98320384     0            70606848     0        149      716013315    2158         52718        706274753
175238   639999549576570  C     92028928     0            66048000     0        146      706363871    1756         16623        662713913
175239   640000255940441  C     97665024     0            70115328     0        143      71044598     134969       591569599    693088396
175240   640000326985039  C     92946432     0            66707456     0        153      1284808543   5896         8703         672417051
175241   640001611793582  C     97665024     0            70086656     0        138      672454387    1897         8659         701585716
175242   640002284247969  C     92946432     0            66723840     0        147      701615445    2645         15924        666385440
175243   640002985863414  C     100679680    0            72290304     0        142      667681067    3388         30788        725934199
175244   640003653544481  C     89931776     0            64565248     0        141      725982977    2152         31483        640424032
175245   640004379527458  C     97140736     0            69722112     0        150      68105377     1794         572370525    698288949
175246   640004447632835  C     93470720     0            67076096     0        143      1270683632   2705         20113        671316514
175247   640005718316467  C     97271808     0            69828608     0        150      671375409    58159        30173        696761294
175248   640006389691876  C     93470720     0            67084288     0        145      696871322    85294        32266        670010312
175249   640007086563198  C     97140736     0            69713920     0        147      670145485    1947         15488        701376097
175250   640007756708683  C     93601792     0            67264512     0        140      701430659    47362        17639        669679903
175251   640008458139342  C     97140736     0            69611520     0        145      669828662    22301        17254        698234887
175252   640009127968004  C     93601792     0            67219456     0        146      698297046    64492        19134        670141928
175253   640009826265050  C     97140736     0            69722112     0        147      670257019    85959        27259        698071692
175254   640010496522069  C     93470720     0            67096576     0        136      698199954    1586         14777        669633049
175255   640011194722023  C     99631104     0            71524352     0        142      670862388    1923         28999        715478313
175256   640011865584411  S     0            0            0            0        0        716002427    2868         46705        0
175257   640012581586838  W     0            0            0            0        0        70924057     2243         0            0
175258   640012652510895  O     0            0            0            0        0        0            0            0            0
root@dev:~#

From the test, 55% cannot reliably cause enough wtime. Even before this patch, the size of open txg is coupled with the size of previous syncing txg. (You can see "Before Patch" results in my previous test. )
It should not be this way if dp_dirty_total is reduced with each block write complete during the syncing process. The open txg should be able to grow much bigger.

The test results does not make sense to me. I am going to investigate this issue. 90%-95% is a good target to work on.
PS: zfs_dirty_data_max is set at 256MB.

@jxdking jxdking marked this pull request as draft May 13, 2021 01:05
@amotin
Copy link
Member

amotin commented May 13, 2021

The test results does not make sense to me. I am going to investigate this issue. 90%-95% is a good target to work on.
PS: zfs_dirty_data_max is set at 256MB.

To make results have more sense, could you please test this in more traditional and controllable environment: real hardware instead of VM, sufficient amount of memory (at least 8GB), sufficient number of CPU cores (I'd prefer my patch to reduce starvation and scheduler effects: #11966), no weird custom tunings, real HDDs and/or SSDs instead of some RAM disks with unrealistic characteristics, etc.

@amotin
Copy link
Member

amotin commented May 13, 2021

In theory, 90%-95% of the peak value should work.
To verify this, I increase it to 55% from 48%. See results below.

Your peak value should grow immediately, while decay very slowly for dozens and hundreds of TXGs. In such scenario 95% should almost never be reached, while 48 or 55% may just appear in the middle of the range and be somehow effective, but according to large wtime suboptiomal. That is why I proposed to measure the percents from dp_dirty_total. It would also make the function continouous with "if (dirty == total)" if the percent increased to 100%.

@jxdking
Copy link
Contributor Author

jxdking commented May 13, 2021

That is why I proposed to measure the percents from dp_dirty_total. It would also make the function continouous with "if (dirty == total)" if the percent increased to 100%.

I will switch to dp_dirty_total instead of the "peak" just as you proposed. It should work.
Just as you mentioned, I found my test environment has some limitation that cause dsl_dir_tempreserve_space failed. That's why the txg size never grows big.
I will post back the new results after I fix my test environment.

@jxdking
Copy link
Contributor Author

jxdking commented May 13, 2021

Your peak value should grow immediately, while decay very slowly for dozens and hundreds of TXGs. In such scenario 95% should almost never be reached, while 48 or 55% may just appear in the middle of the range and be somehow effective, but according to large wtime suboptiomal. That is why I proposed to measure the percents from dp_dirty_total. It would also make the function continouous with "if (dirty == total)" if the percent increased to 100%.

I have update the code and force pushed the 2nd commit. I use 95% of dp_dirty_total, and it works.
I did fix my test environment to re-test it.
The wtime is about 5% of stime, which is expected. The txg size ramps up quickly once fio starts to run.

txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
1565     5496303431850    C     0            0            0            0        0        5119959415   3724         147363       177040
1566     5501423391265    C     0            0            0            0        0        5119963671   2472         60575        135343
1567     5506543354936    C     53886976     94208        39202816     3        316      2860208257   2198         65569        297276216
1568     5509403563193    C     180240384    45056        129978368    1        313      224567505    68088        72719205     1299854320
1569     5509628130698    C     208814080    90112        150659072    2        345      1290474137   114683       82060832     1507674055
1570     5510918604835    C     208683008    45056        150470656    1        360      1502925069   114708       86833356     1505566546
1571     5512421529904    C     207765504    49152        149929984    2        367      1500377420   108795       92036122     1499341693
1572     5513921907324    C     209076224    0            150888448    0        382      1507714805   55855        83723656     1509124667
1573     5515429622129    C     208551936    0            150274048    0        348      1504038022   120483       88754240     1502270760
1574     5516933660151    C     208945152    0            150593536    0        336      1509271676   148127       81732913     1506392139
1575     5518442931827    C     208420864    0            150220800    0        361      1503087396   112930       85079606     1501720665
1576     5519946019223    C     208551936    0            150286336    0        347      1500980139   133760       85807463     1502374394
1577     5521446999362    C     207896576    0            149979136    0        365      1500125236   43547        88156854     1500720894
1578     5522947124598    C     208420864    0            150278144    0        349      1502709352   97334        86129509     1503328939
1579     5524449833950    C     208945152    0            150941696    0        405      1502723116   111111       86740203     1508737206
1580     5525952557066    C     207896576    0            150036480    0        385      1505874501   124159       89597580     1500005841
1581     5527458431567    C     208027648    0            150036480    0        345      1498451865   98345        91194253     1499577541
1582     5528956883432    C     208683008    8192         150700032    2        403      1507652237   187775       83036844     1508258005
1583     5530464535669    C     208420864    0            150503424    0        381      1503345262   80259        88073051     1505072490
1584     5531967880931    C     207503360    4096         149741568    1        372      1500752600   130817       92352989     1497138929
1585     5533468633531    C     208551936    0            150364160    0        337      1504274294   94290        85265121     1503351203
1586     5534972907825    C     208683008    0            150638592    0        352      1502115981   113372       86490165     1506278306
1587     5536475023806    C     208420864    0            150523904    0        361      1505932533   221217       86740773     1505880910
1588     5537980956339    C     208158720    0            150204416    0        359      1502386943   87400        90378672     1501457737
1589     5539483343282    C     208420864    0            150634496    0        374      1504677135   46652        87208316     1506916697
1590     5540988020417    C     209076224    0            150999040    0        369      1507227127   91725        86860868     1509600687
1591     5542495247544    C     208289792    0            150343680    0        371      1506282721   86007        90197136     1503223800
1592     5544001530265    S     0            0            0            0        0        1509079993   116448       84317422     0
1593     5545510610258    O     0            0            0            0        0        0            0            0            0
root@dev:~#

@amotin
Copy link
Member

amotin commented May 13, 2021

5% of wtime is definitely better than 30%. Though it is still few orders of magnitude bigger than qtime. I am thinking whether we could get closer. Also I am curios how would it coexist with scrub, running in sync thread, but not touching dirty data counters (will it inflate wtime with this patch?).

@amotin
Copy link
Member

amotin commented May 13, 2021

Have you actually seen qtime reaching significant times when the promised "improved write throughput" would become measurable? I am not opposing, just trying to better understand the scenarios and how big time reserve should there be.

@jxdking
Copy link
Contributor Author

jxdking commented May 14, 2021

Have you actually seen qtime reaching significant times when the promised "improved write throughput" would become measurable? I am not opposing, just trying to better understand the scenarios and how big time reserve should there be.

I rarely see qtime reaching 1ms. Assuming syncing time is 1s, at best, you will get 0.1% improvement, in theory.
However, in reality, I found txg size is more important. This patch reduces average txg size for a little bit qtime, which is not looking good.

@amotin
Copy link
Member

amotin commented May 14, 2021

When recently working on prefetch, I've noticed that write path does additional reads before opening transaction (see dmu_tx_check_ioerr()). I suspect it could not always be that way, in which case I suppose txg quescing could take more time while all cache misses are resolved with synchronous reads. I wonder if there are still some cases not covered by such prefetch logic when transaction may need more time between assign and commit. And if there are, can we predict that to start quiescing reasonably in advance.

@jxdking
Copy link
Contributor Author

jxdking commented May 15, 2021

I just force pushed the 2nd commit. Added a new module parameter zfs_txg_quiesce_advance to control the
timing of quiescing open txg. The wtime can be turned more percisely. See commit message for more detial 106f767.

Below is a quick test with the default value zfs_txg_quiesce_advance = 0. Run with fio random write.
The default value 0 will have same average txg size as the txg without this patch. However the idle time of the sync stage is minimized.
You can increase zfs_txg_quiesce_advance value to reduce average txg size for more wtime.

txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
21726    103206225664986  C     0            245760       1519616      29       102      633218       795          10791        17687377
21727    103206226298204  C     0            0            0            0        0        24079202     1612         10784        77452
21728    103206250377406  C     0            0            1605632      0        105      6145616      1579         32674        8932632
21729    103206256523022  C     0            0            0            0        0        18879744     2748         40313        90109
21730    103206275402766  C     0            0            778240       0        58       5254805809   2917         122647       11993657
21731    103211530208575  C     57294848     20480        41906176     4        357      1756587977   4992557      19357        321394414
21732    103213286796552  C     221003776    0            159694848    0        381      315266593    2576         11146203     1596576140
21733    103213602063145  C     225984512    0            163201024    0        393      1596079432   2517         11651015     1633176640
21734    103215198142577  C     225984512    0            163164160    0        341      1632792692   2102         12046204     1632454902
21735    103216830935269  C     228605952    0            165183488    0        436      1644532751   74407        37124        1651635616
21736    103218475468020  C     226508800    0            163627008    0        454      1651943777   21304        42311        1635996946
21737    103220127411797  C     225460224    24576        162725888    2        402      1623015694   3171         13051151     1627796546
21738    103221750427491  C     225067008    0            162455552    0        417      1628279032   2409         12580216     1624712731
21739    103223378706523  C     226770944    0            163827712    0        382      1624882610   3176         12426656     1638119133
21740    103225003589133  C     228737024    0            165191680    0        357      1651203408   1301         33939        1651041132
21741    103226654792541  C     226377728    0            163581952    0        396      1651254195   1186885      38175        1635159934
21742    103228306046736  C     226115584    0            163246080    0        420      1619725232   135801       16539105     1631976661
21743    103229925771968  C     224673792    0            162148352    0        396      1635030035   3047         13630285     1621877783
21744    103231560802003  C     225460224    0            162836480    0        390      1621542461   4388         13975316     1627243231
21745    103233182344464  S     0            0            0            0        0        1629081387   2957         12151616     0
21746    103234811425851  O     0            0            0            0        0        0            0            0            0

@jxdking
Copy link
Contributor Author

jxdking commented May 15, 2021

When recently working on prefetch, I've noticed that write path does additional reads before opening transaction (see dmu_tx_check_ioerr()). I suspect it could not always be that way, in which case I suppose txg quescing could take more time while all cache misses are resolved with synchronous reads. I wonder if there are still some cases not covered by such prefetch logic when transaction may need more time between assign and commit. And if there are, can we predict that to start quiescing reasonably in advance.

From my latest test, the first txg (21726) is the the txg right after the pool is imported.
The first txg under fio I see almost 5ms qtime. Is this the one you were talking about?

@amotin
Copy link
Member

amotin commented May 15, 2021

The first txg under fio I see almost 5ms qtime. Is this the one you were talking about?

5ms is obviously a spike, but I am not sure what is the cause of it. As I have told, zfs_write() via dmu_tx_hold_write_by_dnode() calls dmu_tx_check_ioerr() before opening transaction, that moves the metadata read from qtime earlier. I suspect if you add "return (0);" into dmu_tx_check_ioerr() you should see huge qtime increase. What is there now though I am not sure. It can be some memory allocation or some metadata access not covered by the above code, or the thread that opened transaction could be context switched before committing it. Or theoretically there could be some swap-in of data written, if they are not resident somehow. But all those are pretty small and unpredictable to take into account.

@jxdking
Copy link
Contributor Author

jxdking commented May 16, 2021

This patch may introduce some race conditions.
Here is an example, in txg_sync_thread(), there is a place accessing tx->tx_open_txg without tx->tx_cpu[].tc_open_lock. If allow syncing and quiescing run at the same time, there is a chance to break the code.
Those issues may self recoverable and you will never notice. In worst case, they may cause deadlock and hard to be reproduced during test.
I am working on those issue now.

Added a new module parameter zfs_txg_quiesce_advance to control the
timing of quiescing open txg.

The open txg can be quiesced into the pipeline even there is a txg
still syncing. When the dirty data in syncing txg is below
zfs_txg_quiesce_advance, which also measns the sync is about to
complete, quiesce the open txg into the pipeline.
0 value means only quiesce the open txg when all the data in the
previous txg synced. When all dirty data is writen out, the sync
thread may not return immediately. It may be blocked by some mutex
lock. Thus, even with 0 value, it still can help minimize idle time
of the sync stage.
Recommended value: 0 - 5% of zfs_dirty_data_max in bytes.

I also move txg_kick() call into dsl_pool_dirty_space() and
dsl_pool_undirty_space(), so that txg_kick() can also be triggered by
undirty routines.

Addressed race condition of tx->tx_open_txg in txg_sync_thread().

Some other style fixes regarding code review.

Signed-off-by: jxdking <[email protected]>
@jxdking
Copy link
Contributor Author

jxdking commented May 17, 2021

I have patched up the race condition of tx_open_txg in txg_sync_thread(). I just force pushed commit 6b40c56

@jxdking jxdking marked this pull request as ready for review May 18, 2021 02:37
@jxdking
Copy link
Contributor Author

jxdking commented May 18, 2021

So far, I am not able to verify the improved throughput claim through my test environment.
From the test runs on GitHub build servers, it seems to be faster. Maybe I just got lucky on those runs.

@jxdking
Copy link
Contributor Author

jxdking commented Jun 21, 2021

In conclusion.
There is fixed overhead to sync a single one txg no matter the size of the txg, such as the task to sync out metaslab, to flush all the vdev and write out uberblock. From my test, those costs are not small. Sometime it costs seconds to flush down a drive with 512MB cache.
Basically, you want a txg to be as large as possible, so that you can minimize the number of txgs that you need. Current txg_kick() only allows 2 txgs in the open, quiesce, and sync status. It inherits largest possible txg. The performance is about optimal.
Thus, this pull request is closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants