Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

" zfs: allocating allocated segment" with logbias=throughput #541

Closed
maxximino opened this issue Jan 23, 2012 · 13 comments
Closed

" zfs: allocating allocated segment" with logbias=throughput #541

maxximino opened this issue Jan 23, 2012 · 13 comments
Milestone

Comments

@maxximino
Copy link
Contributor

Steps to reproduce:

  1. create a dataset with compression=gzip and logbias=througput
  2. debootstrap wheezy into that dataset
  3. chroot into it and open aptitude
  4. when it tries to rebuilds cache (/var/cache/apt/*bin), hangs and on kernel log:
[  287.074252] zfs: allocating allocated segment(offset=816155484672 size=16896)
[  287.074312] SPLError: 1966:0:(spl-err.c:48:vpanic()) SPL PANIC
[  287.074366] SPL: Showing stack for process 1966
[  287.074368] Pid: 1966, comm: z_fr_iss/4 Tainted: P            3.1.10 #1
[  287.074370] Call Trace:
[  287.074386]  [] spl_debug_dumpstack+0x27/0x50 [spl]
[  287.074390]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074396]  [] ? spl_debug_bug+0x1/0xe0 [spl]
[  287.074398]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074403]  [] ? stp_print_flush+0xc3/0x1e0 [stap_9988]
[  287.074410]  [] ? vpanic+0x1/0x60 [spl]
[  287.074413]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074415]  [] ? opt_pre_handler+0x57/0x90
[  287.074421]  [] ? vcmn_err+0x1/0x60 [spl]
[  287.074423]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074425]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074427]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074430]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074432]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074434]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074436]  [] kretprobe_trampoline_holder+0x9/0x9
[  287.074439]  [] ? try_to_wake_up+0x280/0x280
[  287.074444]  [] ? task_alloc+0x160/0x160 [spl]
[  287.074447]  [] kthread+0x96/0xa0
[  287.074449]  [] kernel_thread_helper+0x4/0x10
[  287.074451]  [] ? kthread_worker_fn+0x190/0x190
[  287.074453]  [] ? gs_change+0xb/0xb
[  287.074590] SPL: Dumping log to /tmp/spl-log.1327200972.1966

and IO on every pool halts.
After reboot, zdb -b reports leaked space.

resetting logbias to latency solves the problem.

zfs version: ff998d8

I've been able to reproduce the issue on two different machines, on 4 different pools.

I think that it is better to temporarily (until problem gets really fixed) not honour logbias=throughput setting if integrity of space maps is at risk. (and machine needs to be rebooted after that)

@behlendorf
Copy link
Contributor

Indeed, that's for reporting this. We'll either disable logbias=throughput for now or if the issue ends up being pretty straight forward just fix it.

@dajhorn
Copy link
Contributor

dajhorn commented Mar 17, 2012

This bug recently began happening for me on a computer running Ubuntu 11.10 and ZoL 0.6.0-rc7. Any action that loads /var/cache/apt on the affected computer causes a panic, even when the system is otherwise perfectly idle.

The root pool is more than a year old and is usually above 90% utilization. The panic doesn't instantly happen on a new pool like it does on the older installation.

Running zfs set logbias=latency did not resolve the problem for me, but running zfs set compression=off did.

[20272.805081] zfs: allocating allocated segment(offset=3056241704960 size=119808)
[20272.805174] SPLError: 508:0:(spl-err.c:48:vpanic()) SPL PANIC
[20272.805260] SPL: Showing stack for process 508
[20272.805266] Pid: 508, comm: z_fr_iss/0 Tainted: P            3.0.0-16-generic #29-Ubuntu
[20272.805269] Call Trace:
[20272.805300]  [<ffffffffa00af4d7>] spl_debug_dumpstack+0x27/0x40 [spl]
[20272.805340]  [<ffffffffa00b07e2>] spl_debug_bug+0x82/0xe0 [spl]
[20272.805353]  [<ffffffffa00bb575>] vpanic+0x65/0x80 [spl]
[20272.805374]  [<ffffffffa00b3e23>] ? kmem_alloc_debug+0x263/0x350 [spl]
[20272.805384]  [<ffffffffa00b3e23>] ? kmem_alloc_debug+0x263/0x350 [spl]
[20272.805392]  [<ffffffff811548ed>] ? __kmalloc+0x13d/0x190
[20272.805401]  [<ffffffffa00b3e23>] ? kmem_alloc_debug+0x263/0x350 [spl]
[20272.805413]  [<ffffffffa00bb71f>] vcmn_err+0x4f/0x60 [spl]
[20272.805467]  [<ffffffffa01c2cf2>] zfs_panic_recover+0x52/0x60 [zfs]
[20272.805503]  [<ffffffffa01a7ec4>] space_map_add+0x254/0x3c0 [zfs]
[20272.805510]  [<ffffffff815f2dae>] ? _raw_spin_lock+0xe/0x20
[20272.805520]  [<ffffffffa00b3840>] ? spl_cache_flush+0x270/0x320 [spl]
[20272.805551]  [<ffffffffa0191833>] metaslab_free_dva+0x123/0x1d0 [zfs]
[20272.805583]  [<ffffffffa019335c>] metaslab_free+0x8c/0xc0 [zfs]
[20272.805616]  [<ffffffffa01e762c>] zio_dva_free+0x1c/0x30 [zfs]
[20272.805649]  [<ffffffffa01e85cf>] zio_execute+0x9f/0xf0 [zfs]
[20272.805660]  [<ffffffffa00b811b>] taskq_thread+0x24b/0x5a0 [spl]
[20272.805666]  [<ffffffff8104e6f9>] ? finish_task_switch+0x49/0xf0
[20272.805670]  [<ffffffff81057450>] ? try_to_wake_up+0x200/0x200
[20272.805681]  [<ffffffffa00b7ed0>] ? task_done+0x140/0x140 [spl]
[20272.805686]  [<ffffffff8108135c>] kthread+0x8c/0xa0
[20272.805691]  [<ffffffff815fc324>] kernel_thread_helper+0x4/0x10
[20272.805695]  [<ffffffff810812d0>] ? flush_kthread_worker+0xa0/0xa0
[20272.805699]  [<ffffffff815fc320>] ? gs_change+0x13/0x13
[20272.805766] SPL: Dumping log to /tmp/spl-log.1332013476.508

@behlendorf
Copy link
Contributor

Google turned up quite a few very similar issues for both FreeBSD and Illumos. But it seems there is at least still one issue lurking.

http://unix.derkeiler.com/Mailing-Lists/FreeBSD/current/2009-02/msg00346.html
https://www.illumos.org/issues/1175
https://www.illumos.org/issues/883 (This was fixed in 0.6.0-rc6)

@behlendorf
Copy link
Contributor

I've seen a variety of bug reports now which suggest that there is a lingering issue (in all the ZFS implementations) where the space map can be updated incorrectly when a file system is filed to capacity.

@dajhorn
Copy link
Contributor

dajhorn commented Aug 11, 2012

The zfs: allocating allocated segment error can be temporarily ignored by setting the zfs_recover option like this:

 # modprobe zfs zfs_recover=1

@maxximino
Copy link
Contributor Author

I'm no longer able to reproduce this issue on a file-backed zpool with a recent ZFS version.
However my kernel log buffer was spammed of stack traces which look like other places which need KM_SLEEP to KM_PUSHPAGE change.
I don't see the "Fixing allocation..." message presumably because it's written with SDEBUG_LIMIT.
I've added "write the patch" in my TODO list.
Kernel log: http://pastebin.com/3A8nbXDk
If someone else confirms that now it's OK, I think that this can be closed.

@maxximino
Copy link
Contributor Author

Nevermind, reproduced the issue. It's still a problem.

gunnarbeutner added a commit to gunnarbeutner/zfs that referenced this issue Oct 30, 2012
We need to wait for pending I/O to finish before trying to clear the
dirty record's override state.

This problem occurs when multiple threads call dbuf_dirty() for the
same dbuf object while it is currently being synced via dmu_sync()
but hasn't reached dmu_sync_done() yet:

  zfs: allocating allocated segment(offset=323629568 size=78336)
  SPLError: 2283:0:(spl-err.c:48:vpanic()) SPL PANIC

Issue openzfs#541
@gunnarbeutner
Copy link
Contributor

I've been able to reproduce this issue reliably using the following program (which pretty much does what aptitude does):

#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

int main(int argc, char **argv)
{
    int fd, i;
    char *ptr;

    unlink("/tank/test.tmp");
    fd = open("/tank/test.tmp", O_RDWR|O_CREAT|O_EXCL, 0666);

    size_t len = 20 * 1024 * 1024;

    lseek(fd, len, SEEK_SET);
    write(fd, "", 1);
    ptr = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

    for (i = 0; i < len; i++)
        ptr[i] = rand() % 0xff;

    msync(ptr, 1024, MS_SYNC);
}

And here's the stacktrace I get:

[   91.767949] zfs: allocating allocated segment(offset=342095872 size=86528)
[   91.767965] SPLError: 1396:0:(spl-err.c:48:vpanic()) SPL PANIC
[   91.767977] SPL: Showing stack for process 1396
[   91.767990] Pid: 1396, comm: z_fr_iss/1 Tainted: P           O 3.2.0-32-generic #51-Ubuntu
[   91.767997] Call Trace:
[   91.768029]  [<ffffffffa002d4d7>] spl_debug_dumpstack+0x27/0x40 [spl]
[   91.768054]  [<ffffffffa002e7c2>] spl_debug_bug+0x82/0xe0 [spl]
[   91.768082]  [<ffffffffa003a3b5>] vpanic+0x65/0x80 [spl]
[   91.768103]  [<ffffffff8163f425>] ? xen_spin_unlock_slow+0x3d/0x63
[   91.768116]  [<ffffffff81011ce8>] ? xen_spin_unlock+0x18/0x30
[   91.768135]  [<ffffffff810675be>] ? console_trylock_for_printk+0x9e/0xb0
[   91.768160]  [<ffffffff81067944>] ? vprintk+0x1f4/0x4a0
[   91.768183]  [<ffffffffa003a55f>] vcmn_err+0x4f/0x60 [spl]
[   91.768238]  [<ffffffffa014a1c2>] zfs_panic_recover+0x52/0x60 [zfs]
[   91.768264]  [<ffffffffa012d8a4>] space_map_add+0x264/0x3d0 [zfs]
[   91.768268]  [<ffffffff810674cc>] ? console_trylock+0x1c/0x70
[   91.768274]  [<ffffffff810675be>] ? console_trylock_for_printk+0x9e/0xb0
[   91.768304]  [<ffffffffa011698b>] metaslab_free_dva+0x12b/0x1d0 [zfs]
[   91.768329]  [<ffffffffa011864c>] metaslab_free+0x8c/0xc0 [zfs]
[   91.768354]  [<ffffffffa016fd39>] zio_dva_free+0x59/0x80 [zfs]
[   91.768378]  [<ffffffffa0170cdf>] zio_execute+0x9f/0xf0 [zfs]
[   91.768386]  [<ffffffffa0036c4b>] taskq_thread+0x23b/0x590 [spl]
[   91.768392]  [<ffffffff81056153>] ? finish_task_switch+0x53/0xf0
[   91.768414]  [<ffffffff8105f680>] ? try_to_wake_up+0x200/0x200
[   91.768427]  [<ffffffffa0036a10>] ? task_done+0x140/0x140 [spl]
[   91.768436]  [<ffffffff8108a0dc>] kthread+0x8c/0xa0
[   91.768463]  [<ffffffff816655b4>] kernel_thread_helper+0x4/0x10
[   91.768472]  [<ffffffff81663663>] ? int_ret_from_sys_call+0x7/0x1b
[   91.768485]  [<ffffffff8165b33c>] ? retint_restore_args+0x5/0x6
[   91.768496]  [<ffffffff816655b0>] ? gs_change+0x13/0x13

By adding some printk()s I've been able to backtrace this problem to two space_map_add() calls:

[   91.767893] <1395> free:zio_dva_free(ffff8800790b8fe0): spa_sync<2>
[   91.767905] <1395> space_map_add(ffff880078d5c310, 342095872, 86528)

[   91.767920] <1396> free:zio_dva_free(ffff8800790b93c8): spa_sync<2>
[   91.767939] <1396> space_map_add(ffff880078d5c310, 342095872, 86528)

So apparently we're trying to add the same byte range to the free space map twice.

Those two zios come from spa_free_sync_cb():

[   91.751977] <1409> spa_free_sync_cb(): dva: off=342095872, size=86528
[   91.751997] <1409> spa_free_sync_cb(): dva: off=342095872, size=86528

And even more tracing:

[   91.277151] <1423> dbuf_dirty()
[   91.277155] <1423> dbuf_unoverride()
[   91.277158] <1423> zio_free(): dva: off=342095872, size=86528

[   91.284462] <1460> dbuf_dirty()
[   91.284466] <1460> dbuf_unoverride()
[   91.284469] <1460> zio_free(): dva: off=342095872, size=86528

So something in dbuf_dirty()/dbuf_unoverride() is trying to free the same dva twice. At that point I ran out of ideas and turned on --enable-debug which made things a lot easier:

[  115.828210] SPLError: 1797:0:(dbuf.c:807:dbuf_unoverride()) ASSERTION(dr->dt.dl.dr_override_state != DR_IN_DMU_SYNC) failed
[  115.828229] SPLError: 1797:0:(dbuf.c:807:dbuf_unoverride()) SPL PANIC
[  115.828235] SPL: Showing stack for process 1797
[  115.828240] Pid: 1797, comm: flush-zfs-1 Tainted: P           O 3.2.0-32-generic #51-Ubuntu
[  115.828244] Call Trace:
[  115.828261]  [<ffffffffa002d4d7>] spl_debug_dumpstack+0x27/0x40 [spl]
[  115.828269]  [<ffffffffa002e7c2>] spl_debug_bug+0x82/0xe0 [spl]
[  115.828314]  [<ffffffffa00e940a>] dbuf_unoverride+0x13a/0x240 [zfs]
[  115.828343]  [<ffffffffa00f220d>] dbuf_dirty+0x159d/0x15d0 [zfs]
[  115.828367]  [<ffffffffa00ee9a3>] ? dbuf_read+0x8f3/0xf20 [zfs]
[  115.828394]  [<ffffffffa00fb5ad>] ? dmu_buf_hold_array_by_dnode+0x2ad/0x830 [zfs]
[  115.828425]  [<ffffffffa01148dd>] ? dnode_rele+0x6d/0x150 [zfs]
[  115.828452]  [<ffffffffa00f3648>] dmu_buf_will_dirty+0xd8/0x1b0 [zfs]
[  115.828480]  [<ffffffffa00fbe55>] dmu_write+0x85/0x250 [zfs]
[  115.828533]  [<ffffffffa01b84ae>] zfs_putpage+0x1ee/0x380 [zfs]
[  115.828546]  [<ffffffff813100c0>] ? prio_tree_next+0xb0/0x1a0
[  115.828613]  [<ffffffffa01d51cc>] zpl_putpage+0x6c/0x160 [zfs]
[  115.828637]  [<ffffffff811216cd>] write_cache_pages+0x20d/0x460
[  115.828677]  [<ffffffffa01d5160>] ? zpl_readpage+0xc0/0xc0 [zfs]
[  115.828728]  [<ffffffffa01d5098>] zpl_writepages+0x18/0x20 [zfs]
[  115.828741]  [<ffffffff81122831>] do_writepages+0x21/0x40
[  115.828754]  [<ffffffff811a1440>] writeback_single_inode+0x180/0x430
[  115.828790]  [<ffffffff811a1b06>] writeback_sb_inodes+0x1b6/0x270
[  115.828827]  [<ffffffff811a1c5e>] __writeback_inodes_wb+0x9e/0xd0
[  115.828855]  [<ffffffff811a1f0b>] wb_writeback+0x27b/0x330
[  115.828871]  [<ffffffff810126ef>] ? __switch_to+0xff/0x360
[  115.828889]  [<ffffffff811935a2>] ? get_nr_dirty_inodes+0x52/0x80
[  115.828913]  [<ffffffff811a205f>] wb_check_old_data_flush+0x9f/0xb0
[  115.828930]  [<ffffffff811a2f51>] wb_do_writeback+0x151/0x1d0
[  115.828941]  [<ffffffff8100330e>] ? xen_end_context_switch+0x1e/0x30
[  115.828955]  [<ffffffff810767a0>] ? usleep_range+0x50/0x50
[  115.828975]  [<ffffffff811a3053>] bdi_writeback_thread+0x83/0x2a0
[  115.828992]  [<ffffffff8165b04e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[  115.829016]  [<ffffffff811a2fd0>] ? wb_do_writeback+0x1d0/0x1d0
[  115.829029]  [<ffffffff8108a0dc>] kthread+0x8c/0xa0
[  115.829046]  [<ffffffff816655b4>] kernel_thread_helper+0x4/0x10
[  115.829058]  [<ffffffff81663663>] ? int_ret_from_sys_call+0x7/0x1b
[  115.829078]  [<ffffffff8165b33c>] ? retint_restore_args+0x5/0x6
[  115.829108]  [<ffffffff816655b0>] ? gs_change+0x13/0x13

It looks like dbuf_dirty() is being called for a dbuf object that is currently being synced to disk by dmu_sync() but hasn't quite finished yet (i.e. dmu_sync_done()).

The patch at 86d68d3 appears to fix the problem for me. However I'm not entirely sure whether dropping db->db_mtx at that spot is safe. I think it is but I might need a second opinion on that. :)

@behlendorf
Copy link
Contributor

@gunnarbeutner Nice job running this one to ground! But I think the right fix is to handle this up in zfs_putpage() by preventing the race is the first place. This is how the issue is handled in Illumos and it was only a problem on Linux because I accidentally dropped these locks when rewriting the function for the Linux VFS.

Can you please review and test the patch, d1532291ad77b6658d01b7df76030e8b8c8d51af . Without the patch applied I was able to hit the issue with your test case, with it applied I can't produce the crash anymore.

@gunnarbeutner
Copy link
Contributor

The patch looks good. I've been running my test program in a loop for about an hour now and haven't seen any crashes. aptitude appears to be working as well.

@behlendorf
Copy link
Contributor

In my automated testing I've noticed the patch hurts mmap'ed I/O performance more than I'd like. We'll need to think about this a little more before merging it, we may want to move the locking up to the broader zpl_writepages() function.

@behlendorf
Copy link
Contributor

OK, I was able to investigate this further. The patch actually doesn't significantly impact performance, but what it does do is expose an unrelated condition variable issue which results in a hung task. Before I merge this change I want to run down this second issue.

behlendorf added a commit to behlendorf/zfs that referenced this issue Nov 10, 2012
Gunnar Beutner did all the hard work on this one by correctly
identifying that this issue is a race between dmu_sync() and
dbuf_dirty().

Now in all cases the caller is responsible for preventing this
race by making sure the zfs_range_lock() is held when dirtying
a buffer which may be referenced in a log record.  The mmap
case which relies on zfs_putpage() was not taking the range
lock.  This code was accidentally dropped when the function
was rewritten for the Linux VFS.

This patch adds the required range locking to zfs_putpage().

It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which
aren't strictly required due to the VFS holding a reference.
However, this makes the code more consistent with the upsteam
code and there's no harm in being extra careful here.

Original-patch-by: Gunnar Beutner <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#541
@behlendorf
Copy link
Contributor

It turns out the condition variable hang was accidentally introduced by the previous version of this patch. It accidentally introduced a lock inversion between the zfs range lock and the VMs page lock. The updated patch which I just merged doesn't contain this flaw because the range lock is now acquired after the page lock is dropped.

unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
Gunnar Beutner did all the hard work on this one by correctly
identifying that this issue is a race between dmu_sync() and
dbuf_dirty().

Now in all cases the caller is responsible for preventing this
race by making sure the zfs_range_lock() is held when dirtying
a buffer which may be referenced in a log record.  The mmap
case which relies on zfs_putpage() was not taking the range
lock.  This code was accidentally dropped when the function
was rewritten for the Linux VFS.

This patch adds the required range locking to zfs_putpage().

It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which
aren't strictly required due to the VFS holding a reference.
However, this makes the code more consistent with the upsteam
code and there's no harm in being extra careful here.

Original-patch-by: Gunnar Beutner <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#541
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants