" zfs: allocating allocated segment" with logbias=throughput #541

maxximino · 2012-01-23T20:29:07Z

Steps to reproduce:

create a dataset with compression=gzip and logbias=througput
debootstrap wheezy into that dataset
chroot into it and open aptitude
when it tries to rebuilds cache (/var/cache/apt/*bin), hangs and on kernel log:

[  287.074252] zfs: allocating allocated segment(offset=816155484672 size=16896)
[  287.074312] SPLError: 1966:0:(spl-err.c:48:vpanic()) SPL PANIC
[  287.074366] SPL: Showing stack for process 1966
[  287.074368] Pid: 1966, comm: z_fr_iss/4 Tainted: P            3.1.10 #1
[  287.074370] Call Trace:
[  287.074386]  [] spl_debug_dumpstack+0x27/0x50 [spl]
[  287.074390]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074396]  [] ? spl_debug_bug+0x1/0xe0 [spl]
[  287.074398]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074403]  [] ? stp_print_flush+0xc3/0x1e0 [stap_9988]
[  287.074410]  [] ? vpanic+0x1/0x60 [spl]
[  287.074413]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074415]  [] ? opt_pre_handler+0x57/0x90
[  287.074421]  [] ? vcmn_err+0x1/0x60 [spl]
[  287.074423]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074425]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074427]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074430]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074432]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074434]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074436]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074439]  [] ? try_to_wake_up+0x280/0x280
[  287.074444]  [] ? task_alloc+0x160/0x160 [spl]
[  287.074447]  [] kthread+0x96/0xa0
[  287.074449]  [] kernel_thread_helper+0x4/0x10
[  287.074451]  [] ? kthread_worker_fn+0x190/0x190
[  287.074453]  [] ? gs_change+0xb/0xb
[  287.074590] SPL: Dumping log to /tmp/spl-log.1327200972.1966

and IO on every pool halts.
After reboot, zdb -b reports leaked space.

resetting logbias to latency solves the problem.

zfs version: ff998d8

I've been able to reproduce the issue on two different machines, on 4 different pools.

I think that it is better to temporarily (until problem gets really fixed) not honour logbias=throughput setting if integrity of space maps is at risk. (and machine needs to be rebooted after that)

The text was updated successfully, but these errors were encountered:

behlendorf · 2012-01-25T17:19:21Z

Indeed, that's for reporting this. We'll either disable logbias=throughput for now or if the issue ends up being pretty straight forward just fix it.

dajhorn · 2012-03-17T20:42:23Z

This bug recently began happening for me on a computer running Ubuntu 11.10 and ZoL 0.6.0-rc7. Any action that loads /var/cache/apt on the affected computer causes a panic, even when the system is otherwise perfectly idle.

The root pool is more than a year old and is usually above 90% utilization. The panic doesn't instantly happen on a new pool like it does on the older installation.

Running zfs set logbias=latency did not resolve the problem for me, but running zfs set compression=off did.

[20272.805081] zfs: allocating allocated segment(offset=3056241704960 size=119808)
[20272.805174] SPLError: 508:0:(spl-err.c:48:vpanic()) SPL PANIC
[20272.805260] SPL: Showing stack for process 508
[20272.805266] Pid: 508, comm: z_fr_iss/0 Tainted: P            3.0.0-16-generic #29-Ubuntu
[20272.805269] Call Trace:
[20272.805300]  [<ffffffffa00af4d7>] spl_debug_dumpstack+0x27/0x40 [spl]
[20272.805340]  [<ffffffffa00b07e2>] spl_debug_bug+0x82/0xe0 [spl]
[20272.805353]  [<ffffffffa00bb575>] vpanic+0x65/0x80 [spl]
[20272.805374]  [<ffffffffa00b3e23>] ? kmem_alloc_debug+0x263/0x350 [spl]
[20272.805384]  [<ffffffffa00b3e23>] ? kmem_alloc_debug+0x263/0x350 [spl]
[20272.805392]  [<ffffffff811548ed>] ? __kmalloc+0x13d/0x190
[20272.805401]  [<ffffffffa00b3e23>] ? kmem_alloc_debug+0x263/0x350 [spl]
[20272.805413]  [<ffffffffa00bb71f>] vcmn_err+0x4f/0x60 [spl]
[20272.805467]  [<ffffffffa01c2cf2>] zfs_panic_recover+0x52/0x60 [zfs]
[20272.805503]  [<ffffffffa01a7ec4>] space_map_add+0x254/0x3c0 [zfs]
[20272.805510]  [<ffffffff815f2dae>] ? _raw_spin_lock+0xe/0x20
[20272.805520]  [<ffffffffa00b3840>] ? spl_cache_flush+0x270/0x320 [spl]
[20272.805551]  [<ffffffffa0191833>] metaslab_free_dva+0x123/0x1d0 [zfs]
[20272.805583]  [<ffffffffa019335c>] metaslab_free+0x8c/0xc0 [zfs]
[20272.805616]  [<ffffffffa01e762c>] zio_dva_free+0x1c/0x30 [zfs]
[20272.805649]  [<ffffffffa01e85cf>] zio_execute+0x9f/0xf0 [zfs]
[20272.805660]  [<ffffffffa00b811b>] taskq_thread+0x24b/0x5a0 [spl]
[20272.805666]  [<ffffffff8104e6f9>] ? finish_task_switch+0x49/0xf0
[20272.805670]  [<ffffffff81057450>] ? try_to_wake_up+0x200/0x200
[20272.805681]  [<ffffffffa00b7ed0>] ? task_done+0x140/0x140 [spl]
[20272.805686]  [<ffffffff8108135c>] kthread+0x8c/0xa0
[20272.805691]  [<ffffffff815fc324>] kernel_thread_helper+0x4/0x10
[20272.805695]  [<ffffffff810812d0>] ? flush_kthread_worker+0xa0/0xa0
[20272.805699]  [<ffffffff815fc320>] ? gs_change+0x13/0x13
[20272.805766] SPL: Dumping log to /tmp/spl-log.1332013476.508

behlendorf · 2012-03-19T17:56:50Z

Google turned up quite a few very similar issues for both FreeBSD and Illumos. But it seems there is at least still one issue lurking.

http://unix.derkeiler.com/Mailing-Lists/FreeBSD/current/2009-02/msg00346.html
https://www.illumos.org/issues/1175
https://www.illumos.org/issues/883 (This was fixed in 0.6.0-rc6)

behlendorf · 2012-08-10T23:51:39Z

I've seen a variety of bug reports now which suggest that there is a lingering issue (in all the ZFS implementations) where the space map can be updated incorrectly when a file system is filed to capacity.

dajhorn · 2012-08-11T15:14:12Z

The zfs: allocating allocated segment error can be temporarily ignored by setting the zfs_recover option like this:

 # modprobe zfs zfs_recover=1

maxximino · 2012-10-11T13:36:33Z

I'm no longer able to reproduce this issue on a file-backed zpool with a recent ZFS version.
However my kernel log buffer was spammed of stack traces which look like other places which need KM_SLEEP to KM_PUSHPAGE change.
I don't see the "Fixing allocation..." message presumably because it's written with SDEBUG_LIMIT.
I've added "write the patch" in my TODO list.
Kernel log: http://pastebin.com/3A8nbXDk
If someone else confirms that now it's OK, I think that this can be closed.

maxximino · 2012-10-11T23:01:48Z

Nevermind, reproduced the issue. It's still a problem.

We need to wait for pending I/O to finish before trying to clear the dirty record's override state. This problem occurs when multiple threads call dbuf_dirty() for the same dbuf object while it is currently being synced via dmu_sync() but hasn't reached dmu_sync_done() yet: zfs: allocating allocated segment(offset=323629568 size=78336) SPLError: 2283:0:(spl-err.c:48:vpanic()) SPL PANIC Issue openzfs#541

gunnarbeutner · 2012-10-30T17:09:57Z

I've been able to reproduce this issue reliably using the following program (which pretty much does what aptitude does):

#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

int main(int argc, char **argv)
{
    int fd, i;
    char *ptr;

    unlink("/tank/test.tmp");
    fd = open("/tank/test.tmp", O_RDWR|O_CREAT|O_EXCL, 0666);

    size_t len = 20 * 1024 * 1024;

    lseek(fd, len, SEEK_SET);
    write(fd, "", 1);
    ptr = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

    for (i = 0; i < len; i++)
        ptr[i] = rand() % 0xff;

    msync(ptr, 1024, MS_SYNC);
}

And here's the stacktrace I get:

[   91.767949] zfs: allocating allocated segment(offset=342095872 size=86528)
[   91.767965] SPLError: 1396:0:(spl-err.c:48:vpanic()) SPL PANIC
[   91.767977] SPL: Showing stack for process 1396
[   91.767990] Pid: 1396, comm: z_fr_iss/1 Tainted: P           O 3.2.0-32-generic #51-Ubuntu
[   91.767997] Call Trace:
[   91.768029]  [<ffffffffa002d4d7>] spl_debug_dumpstack+0x27/0x40 [spl]
[   91.768054]  [<ffffffffa002e7c2>] spl_debug_bug+0x82/0xe0 [spl]
[   91.768082]  [<ffffffffa003a3b5>] vpanic+0x65/0x80 [spl]
[   91.768103]  [<ffffffff8163f425>] ? xen_spin_unlock_slow+0x3d/0x63
[   91.768116]  [<ffffffff81011ce8>] ? xen_spin_unlock+0x18/0x30
[   91.768135]  [<ffffffff810675be>] ? console_trylock_for_printk+0x9e/0xb0
[   91.768160]  [<ffffffff81067944>] ? vprintk+0x1f4/0x4a0
[   91.768183]  [<ffffffffa003a55f>] vcmn_err+0x4f/0x60 [spl]
[   91.768238]  [<ffffffffa014a1c2>] zfs_panic_recover+0x52/0x60 [zfs]
[   91.768264]  [<ffffffffa012d8a4>] space_map_add+0x264/0x3d0 [zfs]
[   91.768268]  [<ffffffff810674cc>] ? console_trylock+0x1c/0x70
[   91.768274]  [<ffffffff810675be>] ? console_trylock_for_printk+0x9e/0xb0
[   91.768304]  [<ffffffffa011698b>] metaslab_free_dva+0x12b/0x1d0 [zfs]
[   91.768329]  [<ffffffffa011864c>] metaslab_free+0x8c/0xc0 [zfs]
[   91.768354]  [<ffffffffa016fd39>] zio_dva_free+0x59/0x80 [zfs]
[   91.768378]  [<ffffffffa0170cdf>] zio_execute+0x9f/0xf0 [zfs]
[   91.768386]  [<ffffffffa0036c4b>] taskq_thread+0x23b/0x590 [spl]
[   91.768392]  [<ffffffff81056153>] ? finish_task_switch+0x53/0xf0
[   91.768414]  [<ffffffff8105f680>] ? try_to_wake_up+0x200/0x200
[   91.768427]  [<ffffffffa0036a10>] ? task_done+0x140/0x140 [spl]
[   91.768436]  [<ffffffff8108a0dc>] kthread+0x8c/0xa0
[   91.768463]  [<ffffffff816655b4>] kernel_thread_helper+0x4/0x10
[   91.768472]  [<ffffffff81663663>] ? int_ret_from_sys_call+0x7/0x1b
[   91.768485]  [<ffffffff8165b33c>] ? retint_restore_args+0x5/0x6
[   91.768496]  [<ffffffff816655b0>] ? gs_change+0x13/0x13

By adding some printk()s I've been able to backtrace this problem to two space_map_add() calls:

[   91.767893] <1395> free:zio_dva_free(ffff8800790b8fe0): spa_sync<2>
[   91.767905] <1395> space_map_add(ffff880078d5c310, 342095872, 86528)

[   91.767920] <1396> free:zio_dva_free(ffff8800790b93c8): spa_sync<2>
[   91.767939] <1396> space_map_add(ffff880078d5c310, 342095872, 86528)

So apparently we're trying to add the same byte range to the free space map twice.

Those two zios come from spa_free_sync_cb():

[   91.751977] <1409> spa_free_sync_cb(): dva: off=342095872, size=86528
[   91.751997] <1409> spa_free_sync_cb(): dva: off=342095872, size=86528

And even more tracing:

[   91.277151] <1423> dbuf_dirty()
[   91.277155] <1423> dbuf_unoverride()
[   91.277158] <1423> zio_free(): dva: off=342095872, size=86528

[   91.284462] <1460> dbuf_dirty()
[   91.284466] <1460> dbuf_unoverride()
[   91.284469] <1460> zio_free(): dva: off=342095872, size=86528

So something in dbuf_dirty()/dbuf_unoverride() is trying to free the same dva twice. At that point I ran out of ideas and turned on --enable-debug which made things a lot easier:

[  115.828210] SPLError: 1797:0:(dbuf.c:807:dbuf_unoverride()) ASSERTION(dr->dt.dl.dr_override_state != DR_IN_DMU_SYNC) failed
[  115.828229] SPLError: 1797:0:(dbuf.c:807:dbuf_unoverride()) SPL PANIC
[  115.828235] SPL: Showing stack for process 1797
[  115.828240] Pid: 1797, comm: flush-zfs-1 Tainted: P           O 3.2.0-32-generic #51-Ubuntu
[  115.828244] Call Trace:
[  115.828261]  [<ffffffffa002d4d7>] spl_debug_dumpstack+0x27/0x40 [spl]
[  115.828269]  [<ffffffffa002e7c2>] spl_debug_bug+0x82/0xe0 [spl]
[  115.828314]  [<ffffffffa00e940a>] dbuf_unoverride+0x13a/0x240 [zfs]
[  115.828343]  [<ffffffffa00f220d>] dbuf_dirty+0x159d/0x15d0 [zfs]
[  115.828367]  [<ffffffffa00ee9a3>] ? dbuf_read+0x8f3/0xf20 [zfs]
[  115.828394]  [<ffffffffa00fb5ad>] ? dmu_buf_hold_array_by_dnode+0x2ad/0x830 [zfs]
[  115.828425]  [<ffffffffa01148dd>] ? dnode_rele+0x6d/0x150 [zfs]
[  115.828452]  [<ffffffffa00f3648>] dmu_buf_will_dirty+0xd8/0x1b0 [zfs]
[  115.828480]  [<ffffffffa00fbe55>] dmu_write+0x85/0x250 [zfs]
[  115.828533]  [<ffffffffa01b84ae>] zfs_putpage+0x1ee/0x380 [zfs]
[  115.828546]  [<ffffffff813100c0>] ? prio_tree_next+0xb0/0x1a0
[  115.828613]  [<ffffffffa01d51cc>] zpl_putpage+0x6c/0x160 [zfs]
[  115.828637]  [<ffffffff811216cd>] write_cache_pages+0x20d/0x460
[  115.828677]  [<ffffffffa01d5160>] ? zpl_readpage+0xc0/0xc0 [zfs]
[  115.828728]  [<ffffffffa01d5098>] zpl_writepages+0x18/0x20 [zfs]
[  115.828741]  [<ffffffff81122831>] do_writepages+0x21/0x40
[  115.828754]  [<ffffffff811a1440>] writeback_single_inode+0x180/0x430
[  115.828790]  [<ffffffff811a1b06>] writeback_sb_inodes+0x1b6/0x270
[  115.828827]  [<ffffffff811a1c5e>] __writeback_inodes_wb+0x9e/0xd0
[  115.828855]  [<ffffffff811a1f0b>] wb_writeback+0x27b/0x330
[  115.828871]  [<ffffffff810126ef>] ? __switch_to+0xff/0x360
[  115.828889]  [<ffffffff811935a2>] ? get_nr_dirty_inodes+0x52/0x80
[  115.828913]  [<ffffffff811a205f>] wb_check_old_data_flush+0x9f/0xb0
[  115.828930]  [<ffffffff811a2f51>] wb_do_writeback+0x151/0x1d0
[  115.828941]  [<ffffffff8100330e>] ? xen_end_context_switch+0x1e/0x30
[  115.828955]  [<ffffffff810767a0>] ? usleep_range+0x50/0x50
[  115.828975]  [<ffffffff811a3053>] bdi_writeback_thread+0x83/0x2a0
[  115.828992]  [<ffffffff8165b04e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[  115.829016]  [<ffffffff811a2fd0>] ? wb_do_writeback+0x1d0/0x1d0
[  115.829029]  [<ffffffff8108a0dc>] kthread+0x8c/0xa0
[  115.829046]  [<ffffffff816655b4>] kernel_thread_helper+0x4/0x10
[  115.829058]  [<ffffffff81663663>] ? int_ret_from_sys_call+0x7/0x1b
[  115.829078]  [<ffffffff8165b33c>] ? retint_restore_args+0x5/0x6
[  115.829108]  [<ffffffff816655b0>] ? gs_change+0x13/0x13

It looks like dbuf_dirty() is being called for a dbuf object that is currently being synced to disk by dmu_sync() but hasn't quite finished yet (i.e. dmu_sync_done()).

The patch at 86d68d3 appears to fix the problem for me. However I'm not entirely sure whether dropping db->db_mtx at that spot is safe. I think it is but I might need a second opinion on that. :)

behlendorf · 2012-10-31T19:22:55Z

@gunnarbeutner Nice job running this one to ground! But I think the right fix is to handle this up in zfs_putpage() by preventing the race is the first place. This is how the issue is handled in Illumos and it was only a problem on Linux because I accidentally dropped these locks when rewriting the function for the Linux VFS.

Can you please review and test the patch, d1532291ad77b6658d01b7df76030e8b8c8d51af . Without the patch applied I was able to hit the issue with your test case, with it applied I can't produce the crash anymore.

gunnarbeutner · 2012-10-31T22:11:29Z

The patch looks good. I've been running my test program in a loop for about an hour now and haven't seen any crashes. aptitude appears to be working as well.

behlendorf · 2012-11-01T20:59:19Z

In my automated testing I've noticed the patch hurts mmap'ed I/O performance more than I'd like. We'll need to think about this a little more before merging it, we may want to move the locking up to the broader zpl_writepages() function.

behlendorf · 2012-11-08T23:36:29Z

OK, I was able to investigate this further. The patch actually doesn't significantly impact performance, but what it does do is expose an unrelated condition variable issue which results in a hung task. Before I merge this change I want to run down this second issue.

Gunnar Beutner did all the hard work on this one by correctly identifying that this issue is a race between dmu_sync() and dbuf_dirty(). Now in all cases the caller is responsible for preventing this race by making sure the zfs_range_lock() is held when dirtying a buffer which may be referenced in a log record. The mmap case which relies on zfs_putpage() was not taking the range lock. This code was accidentally dropped when the function was rewritten for the Linux VFS. This patch adds the required range locking to zfs_putpage(). It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which aren't strictly required due to the VFS holding a reference. However, this makes the code more consistent with the upsteam code and there's no harm in being extra careful here. Original-patch-by: Gunnar Beutner <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#541

behlendorf · 2012-11-10T03:07:59Z

It turns out the condition variable hang was accidentally introduced by the previous version of this patch. It accidentally introduced a lock inversion between the zfs range lock and the VMs page lock. The updated patch which I just merged doesn't contain this flaw because the range lock is now acquired after the page lock is dropped.

Gunnar Beutner did all the hard work on this one by correctly identifying that this issue is a race between dmu_sync() and dbuf_dirty(). Now in all cases the caller is responsible for preventing this race by making sure the zfs_range_lock() is held when dirtying a buffer which may be referenced in a log record. The mmap case which relies on zfs_putpage() was not taking the range lock. This code was accidentally dropped when the function was rewritten for the Linux VFS. This patch adds the required range locking to zfs_putpage(). It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which aren't strictly required due to the VFS holding a reference. However, this makes the code more consistent with the upsteam code and there's no harm in being extra careful here. Original-patch-by: Gunnar Beutner <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#541

maxximino mentioned this issue Aug 28, 2012

msync() on MAP_SHARED mappings is very slow #907

Closed

behlendorf mentioned this issue Sep 12, 2012

TRIM/UNMAP/DISCARD support for vdevs #924

Closed

maxximino mentioned this issue Oct 11, 2012

Switch KM_SLEEP to KM_PUSHPAGE #1038

Closed

behlendorf mentioned this issue Oct 31, 2012

Fix "allocating allocated segment" panic #1075

Closed

behlendorf closed this as completed in 4c837f0 Nov 10, 2012

behlendorf mentioned this issue Jun 19, 2013

SPL PANIC when deleting a snapshot #1499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

" zfs: allocating allocated segment" with logbias=throughput #541

" zfs: allocating allocated segment" with logbias=throughput #541

maxximino commented Jan 23, 2012

behlendorf commented Jan 25, 2012

dajhorn commented Mar 17, 2012

behlendorf commented Mar 19, 2012

behlendorf commented Aug 10, 2012

dajhorn commented Aug 11, 2012

maxximino commented Oct 11, 2012

maxximino commented Oct 11, 2012

gunnarbeutner commented Oct 30, 2012

behlendorf commented Oct 31, 2012

gunnarbeutner commented Oct 31, 2012

behlendorf commented Nov 1, 2012

behlendorf commented Nov 8, 2012

behlendorf commented Nov 10, 2012

" zfs: allocating allocated segment" with logbias=throughput #541

" zfs: allocating allocated segment" with logbias=throughput #541

Comments

maxximino commented Jan 23, 2012

behlendorf commented Jan 25, 2012

dajhorn commented Mar 17, 2012

behlendorf commented Mar 19, 2012

behlendorf commented Aug 10, 2012

dajhorn commented Aug 11, 2012

maxximino commented Oct 11, 2012

maxximino commented Oct 11, 2012

gunnarbeutner commented Oct 30, 2012

behlendorf commented Oct 31, 2012

gunnarbeutner commented Oct 31, 2012

behlendorf commented Nov 1, 2012

behlendorf commented Nov 8, 2012

behlendorf commented Nov 10, 2012