TXG syncing non-stop #3811

DeHackEd · 2015-09-21T19:03:01Z

I have a system that has for reasons not fully understood found itself caught in an infinite transaction commit loop. There is never a transaction not flushing. Eg:

# cat /proc/spl/kstat/zfs/anima/txgs 
18 0 0x01 50 5600 629359807895 200790103637743
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
31601021 200695506940447  C     6530560      0            9462784      0        1188     2060090840   1693         2174040557   2186188720  
31601022 200697567031287  C     2827776      0            5055488      0        847      2174044329   1693         2186188616   1446891910  
31601023 200699741075616  C     4892160      0            5602304      0        894      2186261263   2109         1446822410   1867058923  
31601024 200701927336879  C     4882432      0            5030912      0        671      1446861787   1034         1867024714   581889825   
31601025 200703374198666  C     2532864      0            4899840      0        810      1867030869   2280         581885839    2743033451  
31601026 200705241229535  C     7454720      0            9946112      0        1300     581950964    1782         2742971195   1991405545  
31601027 200705823180499  C     2795008      0            5264384      0        826      2743033419   1843         1991346040   1578566011  
31601028 200708566213918  C     2287104      0            3949568      0        599      1991550428   1280         1578365303   1266690493  
31601029 200710557764346  C     2975232      0            3630080      0        587      1578368196   1172         1266691368   988292375   
31601030 200712136132542  C     2719744      0            4353024      0        695      1266732858   1257         988253365    1269193178  
31601031 200713402865400  C     2064384      0            4301824      0        712      988259215    1845         1269189260   1318036710  
31601032 200714391124615  C     3433984      0            4379648      0        710      1269207331   1790         1318021910   1276174935  
31601033 200715660331946  C     6104576      32768        12534784     2        1478     1318030744   1621         1276168950   3632021982  
31601034 200716978362690  C     1605632      0            4946944      0        828      1276171973   975          3632023318   2550495221  
31601035 200718254534663  C     4646400      0            6333440      0        915      3632026052   1295         2550495482   1961576884  
31601036 200721886560715  C     3335680      24576        8240128      2        1043     2550745052   1220         1961331618   1919329677  
31601037 200724437305767  C     6546944      0            12438528     0        1371     1961334764   1550         1919329113   2458912542  
31601038 200726398640531  C     4105728      0            6460416      0        1113     1919336123   2162         2458908463   1871424461  
31601039 200728317976654  C     3122688      0            9260032      0        1295     2458945867   1165         1871391244   2432058308  
31601040 200730776922521  C     2139648      16384        4363264      1        733      1871631891   1175         2431821191   1705877376  
31601041 200732648554412  C     7431680      20480        6142976      2        868      2431824378   1460         1705877027   2235876248  
31601042 200735080378790  C     6373376      0            9828864      0        1226     1705946214   2143         2235809634   957615740   
31601043 200736786325004  C     3132928      16384        5497856      1        907      2235813597   1205         957616802    940364717   
31601044 200739022138601  C     3293184      0            7511040      0        913      957620722    1470         940364182    1586772330  
31601045 200739979759323  C     2090496      0            4674560      0        835      940395527    1675         1586743560   2537173741  
31601046 200740920154850  C     2490368      16384        4256768      1        634      1586774251   1708         2537145699   1106012232  
31601047 200742506929101  C     5203456      32768        5237760      2        780      2537151973   1528         1106008861   1303362471  
31601049 200746150113764  C     2514432      0            5331968      0        880      1303378400   1112         1986463108   3405061932  
31601050 200747453492164  C     4023808      0            5153792      0        862      1986465933   1357         3405061727   1647207073  
31601051 200749439958097  C     3204608      2048         9628672      1        1244     3405093810   1000         1647178170   1903098231  
31601052 200752845051907  C     2991616      0            5417984      0        906      1647203714   1805         1903075466   574683529   
31601053 200754492255621  C     9070080      24576        13080576     4        1567     1903079564   1247         574683253    5249013995  
31601054 200756395335185  C     1933312      69632        3879936      1        669      574686238    980          5249014324   1101214148  
31601055 200756970021423  C     4991488      110592       6807040      9        1013     5249067830   1722         1101162765   999540777   
31601056 200762219089253  C     6481408      8192         5721088      1        853      1101168415   1497         999539055    1994664490  
31601057 200763320257668  C     1935360      0            3758080      0        615      999570580    1726         1994635601   1415369829  
31601058 200764319828248  C     8562688      4096         10460160     1        1349     1994640151   1705         1415368135   2576245158  
31601059 200766314468399  C     3686400      90112        4756480      6        757      1415443156   1747         2576172913   1920723968  
31601060 200767729911555  C     3597824      83968        5695488      6        901      2576229843   1608         1920669815   3201070693  
31601061 200770306141398  C     4187648      32768        7268864      2        1117     1920740234   1754         3201002596   2297082345  
31601062 200772226881632  C     8072704      0            12634624     0        1491     3210611176   1420         2287477428   1069668004  
31601063 200775437492808  C     3211264      4096         5731328      1        908      2287480654   1389         1069667405   3295683394  
31601064 200777724973462  C     2270720      0            3572736      0        617      1073754586   1425         3291598824   1322553790  
31601065 200778798728048  C     5237248      0            8184832      0        1038     3306479065   1495         1307676399   1714571077  
31601066 200782105207113  C     2125824      0            4842496      0        815      1307715185   1282         1714536599   2582946382  
31601067 200783412922298  C     2270720      0            4051968      0        667      1714539763   1343         2582945826   1413471725  
31601068 200785127462061  S     0            0            0            0        0        2593295003   1278         1403125447   0           
31601069 200787720757064  W     0            0            0            0        0        1403187929   1631         0            0           
31601070 200789123944993  O     0            0            0            0        0        0            0            0            0           

# zpool iostat -v 5 5
                                                       capacity     operations    bandwidth
pool                                                alloc   free   read  write   read  write
--------------------------------------------------  -----  -----  -----  -----  -----  -----
anima                                               1005G  3.05T      3    420  44.2K  2.49M
  raidz1                                            1005G  3.05T      3    420  44.2K  2.49M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRY2J8-part3      -      -      1    259  16.0K  1.51M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M3TJSRE5-part3      -      -      1    260  16.0K  1.51M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRYY0S-part3      -      -      1    259  16.3K  1.51M
--------------------------------------------------  -----  -----  -----  -----  -----  -----

                                                       capacity     operations    bandwidth
pool                                                alloc   free   read  write   read  write
--------------------------------------------------  -----  -----  -----  -----  -----  -----
anima                                               1005G  3.05T      1    491  16.0K  2.73M
  raidz1                                            1005G  3.05T      1    491  16.0K  2.73M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRY2J8-part3      -      -      0    292  4.80K  1.66M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M3TJSRE5-part3      -      -      1    289  8.80K  1.64M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRYY0S-part3      -      -      0    293  2.40K  1.68M
--------------------------------------------------  -----  -----  -----  -----  -----  -----

                                                       capacity     operations    bandwidth
pool                                                alloc   free   read  write   read  write
--------------------------------------------------  -----  -----  -----  -----  -----  -----
anima                                               1005G  3.05T      0    692      0  4.47M
  raidz1                                            1005G  3.05T      0    692      0  4.47M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRY2J8-part3      -      -      0    309      0  2.74M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M3TJSRE5-part3      -      -      0    311      0  2.78M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRYY0S-part3      -      -      0    294      0  2.69M
--------------------------------------------------  -----  -----  -----  -----  -----  -----

                                                       capacity     operations    bandwidth
pool                                                alloc   free   read  write   read  write
--------------------------------------------------  -----  -----  -----  -----  -----  -----
anima                                               1005G  3.05T      0    570      0  3.39M
  raidz1                                            1005G  3.05T      0    570      0  3.39M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRY2J8-part3      -      -      0    290      0  2.02M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M3TJSRE5-part3      -      -      0    280      0  2.00M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRYY0S-part3      -      -      0    305      0  2.10M
--------------------------------------------------  -----  -----  -----  -----  -----  -----

                                                       capacity     operations    bandwidth
pool                                                alloc   free   read  write   read  write
--------------------------------------------------  -----  -----  -----  -----  -----  -----
anima                                               1005G  3.05T      0    415  16.0K  2.29M
  raidz1                                            1005G  3.05T      0    415  16.0K  2.29M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRY2J8-part3      -      -      0    231  1.60K  1.36M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M3TJSRE5-part3      -      -      0    241  6.40K  1.41M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRYY0S-part3      -      -      0    241  8.00K  1.42M
--------------------------------------------------  -----  -----  -----  -----  -----  -----

Note that wait time is basically the same as sync time indicating it's always flushing.

# zdb
anima:
    version: 5000
    name: 'anima'
    state: 0
    txg: 31470873
    pool_guid: 3917646944447414655
    errata: 0
    hostname: 'anima'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 3917646944447414655
        create_txg: 4
        children[0]:
            type: 'raidz'
            id: 0
            guid: 8223485726688364433
            nparity: 1
            metaslab_array: 31
            metaslab_shift: 35
            ashift: 12
            asize: 4463881617408
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 14292022289238774064
                path: '/dev/disk/by-id/ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRY2J8-part3'
                whole_disk: 0
                DTL: 1166291
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 10247335713522123093
                path: '/dev/disk/by-id/ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M3TJSRE5-part3'
                whole_disk: 0
                DTL: 937781
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 12517727645203668589
                path: '/dev/disk/by-id/ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6NRYY0S-part3'
                whole_disk: 0
                DTL: 1074125
                create_txg: 4
    features_for_read:

# zfs list -o name,sync
# zfs list -o name,refer,sync|sed  s/tasvideos/container/
NAME                     REFER      SYNC
anima                     181K  disabled
anima/centos-template     490M  disabled
anima/home                524G  disabled
anima/lxc                 181K  disabled
anima/lxc/gameserver     3.58G  disabled
anima/lxc/php            16.7M  disabled
anima/mysql               383M  disabled
anima/container          2.84G  disabled
anima/container/home     6.84G  disabled
anima/container/logs     1.83G  disabled
anima/container/mysql    8.11G  disabled
anima/container/storage  7.82G  disabled
anima/varwww             3.55G  disabled

== Hardware ==
Mobo: Supermicro X8SIE
CPU: Xeon X3430
RAM: 8 GB, ECC enabled
Hard drives: 3x WD Reds, partitioned /boot (mdadm RAID-1), LVM for system (mdadm RAID-5), ZFS in RAID-Z1

== Software ==
SPL: Loaded module v0.6.5-1
ZFS: Loaded module v0.6.4-240_g31f76e2, ZFS pool version 5000, ZFS filesystem version 5
ZFS version is master commit c938adbdbedc44b05eaf862ee0099561c6aafa6b + ABD merge

options zfs zfs_arc_max=5000000000 zfs_arc_meta_limit=1000000000 zfs_dirty_data_max=1000000000 zfs_dirty_data_max_max=2000000000 zfs_dirty_data_max_max_percent=50 zfs_dirty_data_max_percent=40 zfs_dirty_data_sync=300000000

options spl spl_taskq_thread_dynamic=0

Server runs 2 web hosting platforms via LXC with Apache + MySQL, InnoDB has been configured to disable cache flushes for debugging purposes. The container is root ZFS, the host is root ext4.

Even if I freeze the container and shutdown hosting on the host ZFS is still stuck in a commit cycle.

behlendorf · 2015-09-22T22:15:49Z

@DeHackEd are you saying this is a new issue with 0.6.5.1? I noticed similar behavior on my desktop which is running 0.6.5.1 but I was able to determine that I/O was legitimately caused by a particularly busy firefox tab. In my case it wasn't a regression, just behavior I hadn't noticed before.

However, one thing I noticed in the process of investigating was that relatime updates were inadvertently enabled on my system now that temporary mount options are honored. Might this be what's causing your updates?

DeHackEd · 2015-09-22T23:16:00Z

I don't think it's a new issue, but I'm now able to reproduce it pretty much on demand. It's like this system's native load just brings it to the surface.

I've seen this happen before but never could pin down a cause or a reproducer. I don't think it's caused by load - these sites are not particularly busy... Once I saw a system spinning through TXGs at ludicrous speed. No disk IO whatsoever but /proc/spl/kstat/zfs/$POOL/txgs were just going like nobody's business. Sadly I have no proof of that.

Now I have a system that is kept busy, but also seems to be jamming on I/O outright even when I relieve the load. For experimentation I lxc-freeze'd the container and SIGSTOP'd the host nginx process. That should completely bring all IO to a standstill but the contents of txgs over time would make you think it was still working.

DeHackEd · 2015-09-24T21:46:40Z

I seem to have found a way to reproduce this problem on demand. While one web hosting container is running, fire up another. nginx may be the culprit... which is strange. But flipping it on and off several times has consistently reproduced the issue.

During a test run under strace it did no file I/O that was on a ZFS filesystem. A few unusual system calls were executed, including accept4, readv/writev on sockets, the epoll_* family, ioctl(socket, FIONBIO, [1]) and shutdown. Nothing here explains why it happened. I'll be running under Apache for now.

DeHackEd · 2015-09-25T20:00:33Z

Grabbed a debug dump from the system while it was spinning. Download at http://dehacked.2y.net/zfs-spin-debug.txt

I started the trouble shortly after timestamp 1443209435 and it persists for a little while.

DeHackEd · 2015-10-02T19:24:21Z

I was able to reproduce the spinning-txg problem on a system with no I/O whatsoever

# cat /proc/spl/kstat/zfs/aurora2new/txgs 
9 0 0x01 50 5600 274570222798398 9351395002199088
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
7853674  9351394998774304 C     0            0            0            0        0        56650        5843         46243        55814       
7853675  9351394998830954 C     0            0            0            0        0        59590        5676         45771        55814       
7853676  9351394998890544 C     0            0            0            0        0        57250        5673         47542        54931       
7853677  9351394998947794 C     0            0            0            0        0        61030        6403         44139        55614       
7853678  9351394999008824 C     0            0            0            0        0        57806        5907         45529        55736       
7853679  9351394999066630 C     0            0            0            0        0        59590        6034         44783        55846       
7853680  9351394999126220 C     0            0            0            0        0        58274        5553         45916        54930       
7853681  9351394999184494 C     0            0            0            0        0        58943        5763         44897        55272       
7853682  9351394999243437 C     0            0            0            0        0        57443        5424         46082        54802       
7853683  9351394999300880 C     0            0            0            0        0        60644        5723         43330        55239       
7853684  9351394999361524 C     0            0            0            0        0        57526        7357         42959        54363       
7853685  9351394999419050 C     0            0            0            0        0        72296        7050         28625        54434       
7853686  9351394999491346 C     0            0            0            0        0        42653        5987         44246        62312       
7853687  9351394999533999 C     0            0            0            0        0        58747        5893         51297        55783       
7853688  9351394999592746 C     0            0            0            0        0        62700        5900         47750        54453       
7853689  9351394999655446 C     0            0            0            0        0        62120        6920         42446        55634       
7853690  9351394999717566 C     0            0            0            0        0        56380        5897         45820        56546       
7853691  9351394999773946 C     0            0            0            0        0        59210        5663         46467        55024       
7853692  9351394999833156 C     0            0            0            0        0        60133        7297         42987        55794       
7853693  9351394999893289 C     0            0            0            0        0        57687        5933         45243        56008       
7853694  9351394999950976 C     0            0            0            0        0        59173        6034         44933        106928      
7853695  9351395000010149 C     0            0            0            0        0        59314        5860         95877        55912       
7853696  9351395000069463 C     0            0            0            0        0        110257       7126         43691        55630       
7853697  9351395000179720 C     0            0            0            0        0        58850        6233         44577        56412       
7853698  9351395000238570 C     0            0            0            0        0        57937        5780         46746        55859       
7853699  9351395000296507 C     0            0            0            0        0        60400        5823         45374        56050       
7853700  9351395000356907 C     0            0            0            0        0        58856        5557         45763        56699       
7853702  9351395000483429 C     0            0            0            0        0        51360        6210         43634        57357       
7853703  9351395000534789 C     0            0            0            0        0        58474        5960         46196        55961       
7853704  9351395000593263 C     0            0            0            0        0        60586        6857         43893        545268      
7853705  9351395000653849 C     0            0            0            0        0        59337        7503         532187       53169       
7853706  9351395000713186 C     0            0            0            0        0        550193       6467         39475        51553       
7853707  9351395001263379 C     0            0            0            0        0        54520        5907         40227        51938       
7853708  9351395001317899 C     0            0            0            0        0        54354        6933         40042        52286       
7853709  9351395001372253 C     0            0            0            0        0        54293        5527         42666        51062       
7853710  9351395001426546 C     0            0            0            0        0        63022        5944         33497        51528       
7853711  9351395001489568 C     0            0            0            0        0        48044        5760         40366        61483       
7853712  9351395001537612 C     0            0            0            0        0        54990        6990         48811        52955       
7853713  9351395001592602 C     0            0            0            0        0        63466        5717         42833        57574       
7853714  9351395001656068 C     0            0            0            0        0        56270        6134         46131        52875       
7853715  9351395001712338 C     0            0            0            0        0        58240        6030         43950        52893       
7853716  9351395001770578 C     0            0            0            0        0        59080        7194         40088        51701       
7853717  9351395001829658 C     0            0            0            0        0        56210        7270         38882        51744       
7853718  9351395001885868 C     0            0            0            0        0        53534        5770         41637        53282       
7853719  9351395001939402 C     0            0            0            0        0        56293        5830         41790        51109       
7853720  9351395001995695 C     0            0            0            0        0        54670        6600         40807        52394       
7853721  9351395002050365 C     0            0            0            0        0        56073        6854         40462        125774      
7853722  9351395002106438 C     0            0            0            0        0        55150        5787         115443       52760       
7853723  9351395002161588 S     0            0            0            0        0        129560       5897         42047        0           
7853724  9351395002291148 W     0            0            0            0        0        56584        7120         0            0

The system had been run pretty deep into swap before I killed the process responsible. That would seem to have been the trigger. Job was a mysqldump but the tables were being memory buffered... bad idea.

This system was running: 2.6.32-504.23.4.el6.x86_64, ZFS version 65037d9 plus ABD plus a few of my own patches.

behlendorf · 2015-10-05T16:21:30Z

I think we need to do something like extend the dprintf() in txg_sync_thread to include all the values for the wakeup condition to determine exactly why it's not sleeping. Or if it is exactly what's causing it to be woken immediately. The output you posted cleanly shows that it should have just gone back to sleep.

dweeezil · 2015-10-05T17:10:02Z

@DeHackEd Could you please find out what object 28 in the MOS is (zdb -ddd <pool> 28) and then if it's some sort of bpobj, what it might contain (zdb -dddd <pool> 1 and see if there's a "= 28"). If by some chance it's the deferred free list, then zdb -i -dddd <pool>might prove useful.

The reason I find object 28 interesting is because 1 block is being freed from it fairly repeatedly in your debugging output.

DeHackEd · 2015-10-05T18:37:59Z

It's sync_bplist = 28

dweeezil · 2015-10-05T19:12:07Z

@DeHackEd That's what I suspected; it's the list of deferred frees, which are processed in the sync task. I find this particularly interesting that we also have #3870 (involving freeing space). I think the next best step as @behlendorf suggested would be to add additional dprintf telemetry to txg_sync_thread when it doesn't sleep.

behlendorf · 2015-10-05T19:22:22Z

Just looking at the changes between 0.6.4 and 0.6.5 commit 4bda3bd is the most relevant but it's not immediately clear how it would cause either of these issues. Still it would be an easy test to try reverting it along with adding some debugging to see what impact it has.

DeHackEd · 2015-10-07T00:08:57Z

I think I found an alternative reproducer. I use a tool called Torrus as a mass RRDtool based grapher. I noticed my desktop doing a single cycle of the spin every ~5 minutes, so I recompiled rrdtool with --disable-mmap and it seems to have improved.

I haven't tested with your above suggestions, will give it a spin tomorrow. With a reproducible test case I can migrate it into a VM and try it there.

dweeezil · 2015-10-07T16:39:18Z

I was not able to reproduce this problem. My suspicion is that something's wrong with the processing of the deferred free list and that the sync task is staying awake trying to process it. Normally, there's a handful of entries left in the list which are typically picked up in later sync passes but I wonder if it's trying too hard, for some reason, to process the list in this case. It might be interesting to fiddle with the setting of zfs_sync_pass_deferred_free and see if it makes any difference.

DeHackEd · 2016-07-13T21:12:47Z

A random conversation in IRC made me clue in to what might be the issue. There's a quota on the filesystem which is nearly full, so ZFS is aggressive about flushing out data to prevent the quota from being overflowed. Unlike user quotas where ZFS doesn't care about going over by a bit, dataset quotas are very strictly enforced.

Closing and grabbing brown paper bag.

behlendorf added Bug - Minor labels Sep 22, 2015

behlendorf added this to the 0.7.0 milestone Sep 22, 2015

behlendorf removed the Bug - Point Release label Sep 23, 2015

behlendorf added the Type: Performance Performance improvement or performance problem label Oct 5, 2015

behlendorf modified the milestones: 0.8.0, 0.7.0 Mar 26, 2016

DeHackEd closed this as completed Jul 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TXG syncing non-stop #3811

TXG syncing non-stop #3811

DeHackEd commented Sep 21, 2015

behlendorf commented Sep 22, 2015

DeHackEd commented Sep 22, 2015

DeHackEd commented Sep 24, 2015

DeHackEd commented Sep 25, 2015

DeHackEd commented Oct 2, 2015

behlendorf commented Oct 5, 2015

dweeezil commented Oct 5, 2015

DeHackEd commented Oct 5, 2015

dweeezil commented Oct 5, 2015

behlendorf commented Oct 5, 2015

DeHackEd commented Oct 7, 2015

dweeezil commented Oct 7, 2015

DeHackEd commented Jul 13, 2016

TXG syncing non-stop #3811

TXG syncing non-stop #3811

Comments

DeHackEd commented Sep 21, 2015

behlendorf commented Sep 22, 2015

DeHackEd commented Sep 22, 2015

DeHackEd commented Sep 24, 2015

DeHackEd commented Sep 25, 2015

DeHackEd commented Oct 2, 2015

behlendorf commented Oct 5, 2015

dweeezil commented Oct 5, 2015

DeHackEd commented Oct 5, 2015

dweeezil commented Oct 5, 2015

behlendorf commented Oct 5, 2015

DeHackEd commented Oct 7, 2015

dweeezil commented Oct 7, 2015

DeHackEd commented Jul 13, 2016