Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NULL deref in balance_pgdat() #287

Closed
behlendorf opened this issue Jun 20, 2011 · 34 comments
Closed

NULL deref in balance_pgdat() #287

behlendorf opened this issue Jun 20, 2011 · 34 comments
Milestone

Comments

@behlendorf
Copy link
Contributor

Reported on the zfs-discuss list by Tommy Cheng.


I am running a VM that is doing unrar via NFS to the host computer's
zfs pool (both the rar files and destination are on the zfs pool).
got this while unpacking 4+GB of data.

[101511.985873] BUG: unable to handle kernel NULL pointer dereference at (null)
[101511.985924] IP: [] balance_pgdat+0x2c6/0x6d0
[101511.985952] PGD 4008f4067 PUD 4050cb067 PMD 0
[101511.985977] Oops: 0002 [#1] SMP
[101511.985994] last sysfs file: /sys/devices/system/cpu/cpu5/cpufreq/stats/total_trans
[101511.986015] CPU 2
[101511.986547]
[101511.986558] Pid: 39, comm: kswapd0 Tainted: P            2.6.38-8-server #42-Ubuntu Gigabyte Technology Co., Ltd. GA890FXA-UD5/GA-890FXA-UD5
[101511.986606] RIP: 0010:[]  [] balance_pgdat+0x2c6/0x6d0
[101511.986634] RSP: 0018:ffff88041b395da0  EFLAGS: 00010282
[101511.986649] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[101511.986668] RDX: 00000000002bbb93 RSI: 00000000000000d0 RDI: 000000000008d215
[101511.986687] RBP: ffff88041b395e90 R08: 0000000000000000 R09: 0000000000000000
[101511.986705] R10: 0000000000000064 R11: ffff8804084998f8 R12: 0000000000000000
[101511.986723] R13: ffff88043fffb000 R14: 0000000000000002 R15: ffff88043fffb000
[101511.986743] FS:  00007f7618e30700(0000) GS:ffff8800bfa80000(0000) knlGS:0000000000000000
[101511.986763] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[101511.986779] CR2: 0000000000000000 CR3: 0000000404e95000 CR4: 00000000000006e0
[101511.986798] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[101511.986817] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[101511.986838] Process kswapd0 (pid: 39, threadinfo ffff88041b394000, task ffff88041c83c4a0)
[101511.986857] Stack:
[101511.986866]  ffff88043fffbe00 ffff88041c83c4a0 0000000000000000 ffff88041b395e44
[101511.986895]  ffff88041b395eac 0000000000000000 0000000100000000 00000000002bbb93
[101511.986925]  0000000000000000 ffff88041b395e50 0000000000000000 0000000081087ae7
[101511.986954] Call Trace:
[101511.986970]  [] kswapd+0x143/0x1b0
[101511.986987]  [] ? kswapd+0x0/0x1b0
[101511.987005]  [] kthread+0x96/0xa0
[101511.987024]  [] kernel_thread_helper+0x4/0x10
[101511.987043]  [] ? kthread+0x0/0xa0
[101511.987060]  [] ? kernel_thread_helper+0x0/0x10
[101511.987304] RIP  [] balance_pgdat+0x2c6/0x6d0
[101511.987325]  RSP 
[101511.988015] CR2: 0000000000000000
[101511.990748] ---[ end trace 4491dd843977a02a ]---
@gunnarbeutner
Copy link
Contributor

One thing that's missing in the dmesg output is an allocation failure that user must've had shortly before that oops (SLUB: Unable to allocate memory on node -1 (gfp=0x20)).

This is just theoretical so far (only encountered this once while copying some large files between two datasets), but the only way current->reclaim_state would end up being NULL there seems to be the following call chain:

kswapd -> balance_pgdat -> shrink_slab -> do_shrinker_shrink -> spl_kmem_cache_generic_shrinker -> spl_kmem_cache_reap_now -> spl_slab_reclaim -> some dtor func -> ??? (alloc func) -> alloc_pages_node / alloc_pages_exact_node -> __alloc_pages -> __alloc_pages_nodemask -> _alloc_pages??? -> get_page_from_freelist -> zone_reclaim -> __zone_reclaim (sets current->reclaim_state to NULL)

Adding a dump_stack() call right before the zone_reclaim() call in get_page_from_freelist() might make this easier to debug (requires recompiling the kernel though).

@behlendorf
Copy link
Contributor Author

Another way this could have happened is if get_current() returned a valid but incorrect address. In balance_pgdat the only user of 'current' is for the reclaim_state so if occurred that's where you would expect to see the NULL deref.

I know that sounds like a stretch. But after digging in to issue #279 a bit I'm beginning to wonder if we are hitting some sort of issue with the percpu counters. Both get_current() and get_next_timer_interrupt() (for x86_64) are built on top of percpu counters. And both of these issues could be explained if percpu_read_stable() was somehow returning a bogus address.

But I don't have any evidence of this yet, just a hunch.

@gunnarbeutner
Copy link
Contributor

This one was pretty tough to reproduce. Here's what finally worked:

  1. cp linux_2.6.38.orig.tar.gz /tank
  2. cd /tank; tar xfz linux_2.6.38.orig.tar.gz
  3. zfs create tank/a; zfs create tank/b; zfs create tank/c
  4. stress -m 8 --vm-hang 16 (in a separate vt)
  5. rm -Rf /tank/a/* /tank/b/* /tank/c/*
  6. cp -a /tank /tank/a/zzz/ &; cp -a /tank /tank/b/zzz/ &; cp -a /tank /tank/c/zzz/ &
  7. wait a couple of minutes; if it doesn't work: reboot and start again from step 4

I've added this in page_alloc.c:__alloc_pages_direct_reclaim (after try_to_free_pages()):

    if (current_is_kswapd()) {
            printk(KERN_DEBUG "Entering direct reclaim (clears current->reclaim_state)\n");
            dump_stack();
    }

and in vmscan.c:balance_pgdat (after shrink_slab()):

                    if (current->reclaim_state == NULL)
                            printk(KERN_DEBUG "current->reclaim_state == NULL after shrink_slab()\n");

Here's the debug output I got from that:

[  264.902436] Entering direct reclaim (clears current->reclaim_state)
[  264.902441] Pid: 46, comm: kswapd0 Tainted: P            2.6.38.2 #7
[  264.902443] Call Trace:
[  264.902449]  [<ffffffff8110f342>] ? __alloc_pages_nodemask+0x842/0x850
[  264.902454]  [<ffffffff8104d8f8>] ? hrtick_update+0x38/0x40
[  264.902457]  [<ffffffff81055457>] ? dequeue_task_fair+0x1e7/0x270
[  264.902460]  [<ffffffff81144285>] ? alloc_pages_current+0xa5/0x110
[  264.902464]  [<ffffffff8114d7c5>] ? new_slab+0x1f5/0x290
[  264.902466]  [<ffffffff8114f07d>] ? __slab_alloc+0x20d/0x460
[  264.902470]  [<ffffffff81037c49>] ? default_spin_lock_flags+0x9/0x10
[  264.902474]  [<ffffffff81109205>] ? mempool_alloc_slab+0x15/0x20
[  264.902477]  [<ffffffff8114f95f>] ? kmem_cache_alloc+0xff/0x110
[  264.902479]  [<ffffffff81109205>] ? mempool_alloc_slab+0x15/0x20
[  264.902482]  [<ffffffff81109543>] ? mempool_alloc+0x53/0x130
[  264.902490]  [<ffffffff810856b0>] ? autoremove_wake_function+0x0/0x40
[  264.902494]  [<ffffffff810d4765>] ? call_rcu_sched+0x15/0x20
[  264.902498]  [<ffffffff8119167e>] ? bio_alloc_bioset+0x3e/0xf0
[  264.902501]  [<ffffffff8113a560>] ? end_swap_bio_write+0x0/0x90
[  264.902503]  [<ffffffff811917d5>] ? bio_alloc+0x15/0x30
[  264.902506]  [<ffffffff8113a4e9>] ? get_swap_bio+0x29/0xa0
[  264.902508]  [<ffffffff8113a6c3>] ? swap_writepage+0x43/0xd0
[  264.902511]  [<ffffffff81115841>] ? pageout.clone.33+0x131/0x2f0
[  264.902514]  [<ffffffff8113a834>] ? __add_to_swap_cache+0x84/0xc0
[  264.902517]  [<ffffffff81117c66>] ? shrink_page_list+0x396/0x5c0
[  264.902520]  [<ffffffff8111827a>] ? shrink_inactive_list+0x11a/0x450
[  264.902523]  [<ffffffff81110ffa>] ? determine_dirtyable_memory+0x1a/0x30
[  264.902526]  [<ffffffff81118cd2>] ? shrink_zone+0x3c2/0x520
[  264.902541]  [<ffffffffa016045c>] ? rpcauth_cache_shrinker+0x1cc/0x220 [sunrpc]
[  264.902545]  [<ffffffff8111a025>] ? kswapd+0x735/0xa20
[  264.902548]  [<ffffffff811198f0>] ? kswapd+0x0/0xa20
[  264.902550]  [<ffffffff81084f66>] ? kthread+0x96/0xa0
[  264.902553]  [<ffffffff8100cce4>] ? kernel_thread_helper+0x4/0x10
[  264.902556]  [<ffffffff81084ed0>] ? kthread+0x0/0xa0
[  264.902558]  [<ffffffff8100cce0>] ? kernel_thread_helper+0x0/0x10
[  264.909755] Entering direct reclaim (clears current->reclaim_state)
[  264.909761] Pid: 46, comm: kswapd0 Tainted: P            2.6.38.2 #7
[  264.909763] Call Trace:
[  264.909770]  [<ffffffff8110f342>] ? __alloc_pages_nodemask+0x842/0x850
[  264.909776]  [<ffffffff8104d8f8>] ? hrtick_update+0x38/0x40
[  264.909779]  [<ffffffff81055457>] ? dequeue_task_fair+0x1e7/0x270
[  264.909784]  [<ffffffff81144285>] ? alloc_pages_current+0xa5/0x110
[  264.909789]  [<ffffffff8114d852>] ? new_slab+0x282/0x290
[  264.909794]  [<ffffffff8114f07d>] ? __slab_alloc+0x20d/0x460
[  264.909799]  [<ffffffff81037c49>] ? default_spin_lock_flags+0x9/0x10
[  264.909804]  [<ffffffff81109205>] ? mempool_alloc_slab+0x15/0x20
[  264.909808]  [<ffffffff8114f95f>] ? kmem_cache_alloc+0xff/0x110
[  264.909812]  [<ffffffff81109205>] ? mempool_alloc_slab+0x15/0x20
[  264.909814]  [<ffffffff81109543>] ? mempool_alloc+0x53/0x130
[  264.909818]  [<ffffffff810856b0>] ? autoremove_wake_function+0x0/0x40
[  264.909821]  [<ffffffff810d4765>] ? call_rcu_sched+0x15/0x20
[  264.909825]  [<ffffffff8119167e>] ? bio_alloc_bioset+0x3e/0xf0
[  264.909828]  [<ffffffff8113a560>] ? end_swap_bio_write+0x0/0x90
[  264.909831]  [<ffffffff811917d5>] ? bio_alloc+0x15/0x30
[  264.909833]  [<ffffffff8113a4e9>] ? get_swap_bio+0x29/0xa0
[  264.909835]  [<ffffffff8113a6c3>] ? swap_writepage+0x43/0xd0
[  264.909838]  [<ffffffff81115841>] ? pageout.clone.33+0x131/0x2f0
[  264.909841]  [<ffffffff8113a834>] ? __add_to_swap_cache+0x84/0xc0
[  264.909844]  [<ffffffff81117c66>] ? shrink_page_list+0x396/0x5c0
[  264.909847]  [<ffffffff8111827a>] ? shrink_inactive_list+0x11a/0x450
[  264.909850]  [<ffffffff81110ffa>] ? determine_dirtyable_memory+0x1a/0x30
[  264.909853]  [<ffffffff81118cd2>] ? shrink_zone+0x3c2/0x520
[  264.909863]  [<ffffffffa016045c>] ? rpcauth_cache_shrinker+0x1cc/0x220 [sunrpc]
[  264.909866]  [<ffffffff8111a025>] ? kswapd+0x735/0xa20
[  264.909869]  [<ffffffff811198f0>] ? kswapd+0x0/0xa20
[  264.909872]  [<ffffffff81084f66>] ? kthread+0x96/0xa0
[  264.909874]  [<ffffffff8100cce4>] ? kernel_thread_helper+0x4/0x10
[  264.909877]  [<ffffffff81084ed0>] ? kthread+0x0/0xa0
[  264.909879]  [<ffffffff8100cce0>] ? kernel_thread_helper+0x0/0x10
[  264.937809] Entering direct reclaim (clears current->reclaim_state)
[  264.937825] Pid: 46, comm: kswapd0 Tainted: P            2.6.38.2 #7
[  264.937825] Call Trace:
[  264.937829]  [<ffffffff8110f342>] ? __alloc_pages_nodemask+0x842/0x850
[  264.937845]  [<ffffffff8104d8f8>] ? hrtick_update+0x38/0x40
[  264.937859]  [<ffffffff81055457>] ? dequeue_task_fair+0x1e7/0x270
[  264.937859]  [<ffffffff81144285>] ? alloc_pages_current+0xa5/0x110
[  264.937868]  [<ffffffff8114d7c5>] ? new_slab+0x1f5/0x290
[  264.937875]  [<ffffffff8114f07d>] ? __slab_alloc+0x20d/0x460
[  264.937886]  [<ffffffff81037c49>] ? default_spin_lock_flags+0x9/0x10
[  264.937886]  [<ffffffff81109205>] ? mempool_alloc_slab+0x15/0x20
[  264.937899]  [<ffffffff8114f95f>] ? kmem_cache_alloc+0xff/0x110
[  264.937908]  [<ffffffff81109205>] ? mempool_alloc_slab+0x15/0x20
[  264.937909]  [<ffffffff81109543>] ? mempool_alloc+0x53/0x130
[  264.937909]  [<ffffffff810856b0>] ? autoremove_wake_function+0x0/0x40
[  264.937931]  [<ffffffff8119167e>] ? bio_alloc_bioset+0x3e/0xf0
[  264.937932]  [<ffffffff8113a560>] ? end_swap_bio_write+0x0/0x90
[  264.937954]  [<ffffffff811917d5>] ? bio_alloc+0x15/0x30
[  264.937961]  [<ffffffff8113a4e9>] ? get_swap_bio+0x29/0xa0
[  264.937965]  [<ffffffff8113a6c3>] ? swap_writepage+0x43/0xd0
[  264.937965]  [<ffffffff81115841>] ? pageout.clone.33+0x131/0x2f0
[  264.937984]  [<ffffffff8113a834>] ? __add_to_swap_cache+0x84/0xc0
[  264.937985]  [<ffffffff81117c66>] ? shrink_page_list+0x396/0x5c0
[  264.937996]  [<ffffffff8111827a>] ? shrink_inactive_list+0x11a/0x450
[  264.937996]  [<ffffffff81110ffa>] ? determine_dirtyable_memory+0x1a/0x30
[  264.938005]  [<ffffffff81118cd2>] ? shrink_zone+0x3c2/0x520
[  264.938081]  [<ffffffffa016045c>] ? rpcauth_cache_shrinker+0x1cc/0x220 [sunrpc]
[  264.938122]  [<ffffffff8111a025>] ? kswapd+0x735/0xa20
[  264.938127]  [<ffffffff811198f0>] ? kswapd+0x0/0xa20
[  264.938131]  [<ffffffff81084f66>] ? kthread+0x96/0xa0
[  264.938135]  [<ffffffff8100cce4>] ? kernel_thread_helper+0x4/0x10
[  264.938139]  [<ffffffff81084ed0>] ? kthread+0x0/0xa0
[  264.938141]  [<ffffffff8100cce0>] ? kernel_thread_helper+0x0/0x10
[  264.952551] Entering direct reclaim (clears current->reclaim_state)
[  264.952557] Pid: 46, comm: kswapd0 Tainted: P            2.6.38.2 #7
[  264.952560] Call Trace:
[  264.952567]  [<ffffffff8110f342>] ? __alloc_pages_nodemask+0x842/0x850
[  264.952573]  [<ffffffff8104d8f8>] ? hrtick_update+0x38/0x40
[  264.952577]  [<ffffffff81055457>] ? dequeue_task_fair+0x1e7/0x270
[  264.952581]  [<ffffffff81144285>] ? alloc_pages_current+0xa5/0x110
[  264.952586]  [<ffffffff8114d852>] ? new_slab+0x282/0x290
[  264.952590]  [<ffffffff8114f07d>] ? __slab_alloc+0x20d/0x460
[  264.952594]  [<ffffffff81037c49>] ? default_spin_lock_flags+0x9/0x10
[  264.952600]  [<ffffffff81109205>] ? mempool_alloc_slab+0x15/0x20
[  264.952604]  [<ffffffff8114f95f>] ? kmem_cache_alloc+0xff/0x110
[  264.952607]  [<ffffffff81109205>] ? mempool_alloc_slab+0x15/0x20
[  264.952611]  [<ffffffff81109543>] ? mempool_alloc+0x53/0x130
[  264.952717]  [<ffffffff810856b0>] ? autoremove_wake_function+0x0/0x40
[  264.952725]  [<ffffffff8119167e>] ? bio_alloc_bioset+0x3e/0xf0
[  264.952729]  [<ffffffff8113a560>] ? end_swap_bio_write+0x0/0x90
[  264.952733]  [<ffffffff811917d5>] ? bio_alloc+0x15/0x30
[  264.952736]  [<ffffffff8113a4e9>] ? get_swap_bio+0x29/0xa0
[  264.952739]  [<ffffffff8113a6c3>] ? swap_writepage+0x43/0xd0
[  264.952744]  [<ffffffff81115841>] ? pageout.clone.33+0x131/0x2f0
[  264.952747]  [<ffffffff8113a834>] ? __add_to_swap_cache+0x84/0xc0
[  264.952752]  [<ffffffff81117c66>] ? shrink_page_list+0x396/0x5c0
[  264.952756]  [<ffffffff8111827a>] ? shrink_inactive_list+0x11a/0x450
[  264.952760]  [<ffffffff81110ffa>] ? determine_dirtyable_memory+0x1a/0x30
[  264.952764]  [<ffffffff81118cd2>] ? shrink_zone+0x3c2/0x520
[  264.952777]  [<ffffffffa016045c>] ? rpcauth_cache_shrinker+0x1cc/0x220 [sunrpc]
[  264.952781]  [<ffffffff8111a025>] ? kswapd+0x735/0xa20
[  264.952785]  [<ffffffff811198f0>] ? kswapd+0x0/0xa20
[  264.952789]  [<ffffffff81084f66>] ? kthread+0x96/0xa0
[  264.952794]  [<ffffffff8100cce4>] ? kernel_thread_helper+0x4/0x10
[  264.952798]  [<ffffffff81084ed0>] ? kthread+0x0/0xa0
[  264.952801]  [<ffffffff8100cce0>] ? kernel_thread_helper+0x0/0x10
[  264.974650] current->reclaim_state == NULL after shrink_slab()
[  264.974659] current->reclaim_state == NULL after shrink_slab()
[  264.974664] current->reclaim_state == NULL after shrink_slab()
[  264.974669] current->reclaim_state == NULL after shrink_slab()
[  264.978708] current->reclaim_state == NULL after shrink_slab()
[  264.978716] current->reclaim_state == NULL after shrink_slab()
[  264.981520] current->reclaim_state == NULL after shrink_slab()
[  264.981527] current->reclaim_state == NULL after shrink_slab()
[  264.992643] current->reclaim_state == NULL after shrink_slab()
[  264.993213] current->reclaim_state == NULL after shrink_slab()
[  264.993219] current->reclaim_state == NULL after shrink_slab()
[  264.993225] current->reclaim_state == NULL after shrink_slab()
[  265.004076] current->reclaim_state == NULL after shrink_slab()
[  265.004084] current->reclaim_state == NULL after shrink_slab()
[  265.004091] current->reclaim_state == NULL after shrink_slab()
[  265.004097] current->reclaim_state == NULL after shrink_slab()
[  265.008254] current->reclaim_state == NULL after shrink_slab()
[  265.008260] current->reclaim_state == NULL after shrink_slab()
[  265.008824] current->reclaim_state == NULL after shrink_slab()
[  265.008849] current->reclaim_state == NULL after shrink_slab()
[  265.009007] current->reclaim_state == NULL after shrink_slab()
[  265.009012] current->reclaim_state == NULL after shrink_slab()
[  265.009394] current->reclaim_state == NULL after shrink_slab()
[  265.009414] current->reclaim_state == NULL after shrink_slab()
[  265.009418] current->reclaim_state == NULL after shrink_slab()
[  265.009422] current->reclaim_state == NULL after shrink_slab()
[  265.009554] current->reclaim_state == NULL after shrink_slab()
[  265.009560] current->reclaim_state == NULL after shrink_slab()
[  265.009688] current->reclaim_state == NULL after shrink_slab()
[  265.009694] current->reclaim_state == NULL after shrink_slab()
[  265.009698] current->reclaim_state == NULL after shrink_slab()
[  265.009702] current->reclaim_state == NULL after shrink_slab()
[  265.009844] current->reclaim_state == NULL after shrink_slab()
[  265.009850] current->reclaim_state == NULL after shrink_slab()
[  265.009992] current->reclaim_state == NULL after shrink_slab()
[  265.009999] current->reclaim_state == NULL after shrink_slab()
[  265.010003] current->reclaim_state == NULL after shrink_slab()
[  265.010006] current->reclaim_state == NULL after shrink_slab()
[  265.010137] current->reclaim_state == NULL after shrink_slab()
[  265.010143] current->reclaim_state == NULL after shrink_slab()
[  265.010147] current->reclaim_state == NULL after shrink_slab()
[  265.010151] current->reclaim_state == NULL after shrink_slab()
[  265.010271] current->reclaim_state == NULL after shrink_slab()
[  265.010278] current->reclaim_state == NULL after shrink_slab()
[  265.010281] current->reclaim_state == NULL after shrink_slab()
[  265.010334] current->reclaim_state == NULL after shrink_slab()
[  265.010395] current->reclaim_state == NULL after shrink_slab()
[  265.010399] current->reclaim_state == NULL after shrink_slab()
[  265.010477] current->reclaim_state == NULL after shrink_slab()
[  265.010483] current->reclaim_state == NULL after shrink_slab()
[  265.010487] current->reclaim_state == NULL after shrink_slab()
[  265.010491] current->reclaim_state == NULL after shrink_slab()
[  265.010620] current->reclaim_state == NULL after shrink_slab()
[  265.010626] current->reclaim_state == NULL after shrink_slab()
[  265.010630] current->reclaim_state == NULL after shrink_slab()
[  265.010683] current->reclaim_state == NULL after shrink_slab()
[  265.010689] current->reclaim_state == NULL after shrink_slab()
[  265.010693] current->reclaim_state == NULL after shrink_slab()
[  265.010823] current->reclaim_state == NULL after shrink_slab()
[  265.010830] current->reclaim_state == NULL after shrink_slab()
[  265.010833] current->reclaim_state == NULL after shrink_slab()
[  265.010837] current->reclaim_state == NULL after shrink_slab()
[  265.010841] current->reclaim_state == NULL after shrink_slab()
[  265.010896] current->reclaim_state == NULL after shrink_slab()
[  265.011405] current->reclaim_state == NULL after shrink_slab()
[  265.011637] current->reclaim_state == NULL after shrink_slab()
[  265.011695] current->reclaim_state == NULL after shrink_slab()
[  265.011772] current->reclaim_state == NULL after shrink_slab()
[  265.011779] current->reclaim_state == NULL after shrink_slab()
[  265.011782] current->reclaim_state == NULL after shrink_slab()
[  265.011786] current->reclaim_state == NULL after shrink_slab()
[  265.011790] current->reclaim_state == NULL after shrink_slab()
[  265.011889] current->reclaim_state == NULL after shrink_slab()
[  265.011894] current->reclaim_state == NULL after shrink_slab()
[  265.012071] current->reclaim_state == NULL after shrink_slab()
[  265.012078] current->reclaim_state == NULL after shrink_slab()
[  265.012262] current->reclaim_state == NULL after shrink_slab()
[  265.012269] current->reclaim_state == NULL after shrink_slab()
[  265.012600] current->reclaim_state == NULL after shrink_slab()
[  265.012630] current->reclaim_state == NULL after shrink_slab()
[  265.013685] current->reclaim_state == NULL after shrink_slab()
[  265.014011] current->reclaim_state == NULL after shrink_slab()
[  265.016479] current->reclaim_state == NULL after shrink_slab()
[  265.113866] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  265.113988] IP: [<ffffffff81119d3b>] kswapd+0x44b/0xa20
[  265.114041] PGD 7a03e067 PUD 7a03d067 PMD 0
[  265.114089] Oops: 0002 [#1] SMP
[  265.114131] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
[  265.114211] CPU 4
[  265.114221] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate vesafb nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ppdev snd_ens1371 vmw_balloon gameport snd_rawmidi psmouse snd_seq_device serio_raw snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 parport_pc lp shpchp parport e1000 mptspi floppy mptscsih mptbase
[  265.114882]
[  265.114915] Pid: 46, comm: kswapd0 Tainted: P            2.6.38.2 #7 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
[  265.115022] RIP: 0010:[<ffffffff81119d3b>]  [<ffffffff81119d3b>] kswapd+0x44b/0xa20
[  265.115105] RSP: 0000:ffff88006be8fdc0  EFLAGS: 00010286
[  265.115151] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000c737
[  265.115204] RDX: 00000000000380bc RSI: 00000000000000d0 RDI: 000000000000b51b
[  265.115257] RBP: ffff88006be8fee0 R08: 0000000000000000 R09: 0000000000000000
[  265.115311] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  265.115364] R13: ffff88007feeb000 R14: 0000000000000001 R15: ffff88007feeb000
[  265.115417] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[  265.115497] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  265.116434] CR2: 0000000000000000 CR3: 000000007a03f000 CR4: 00000000000006e0
[  265.117366] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  265.118725] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  265.142302] Process kswapd0 (pid: 46, threadinfo ffff88006be8e000, task ffff88006be616e0)
[  265.144344] Stack:
[  265.145195]  ffff88007feeb700 ffff88006be616e0 ffff88006be8fe68 ffff88006be616e0
[  265.147045]  0000000000000000 ffff88007feeb700 ffff88006be616e0 ffff88006be8fe84
[  265.149887]  ffff88006be616e0 0000000000000000 0000000100000000 00000000000380bc
[  265.176895] Call Trace:
[  265.178057]  [<ffffffff811198f0>] ? kswapd+0x0/0xa20
[  265.179208]  [<ffffffff81084f66>] kthread+0x96/0xa0
[  265.203759]  [<ffffffff8100cce4>] kernel_thread_helper+0x4/0x10
[  265.204402]  [<ffffffff81084ed0>] ? kthread+0x0/0xa0
[  265.205137]  [<ffffffff8100cce0>] ? kernel_thread_helper+0x0/0x10
[  265.205945] Code: c0 44 89 f1 4c 89 ff 48 c1 e2 03 e8 80 1a ff ff 84 c0 0f 84 e8 02 00 00 48 8b 85 40 ff ff ff 48 8b 95 38 ff ff ff be d0 00 00 00 <48> c7 00 00 00 00 00 48 8b bd 70 ff ff ff e8 82 cc ff ff 48 8b
[  265.208156] RIP  [<ffffffff81119d3b>] kswapd+0x44b/0xa20
[  265.209156]  RSP <ffff88006be8fdc0>
[  265.210500] CR2: 0000000000000000
[  265.233368] ---[ end trace dc36242347b03186 ]---
[  292.408815] SPLError: 1310:0:(dmu_tx.c:1061:dmu_tx_assign()) ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) failed
[  292.412121] SPLError: 1310:0:(dmu_tx.c:1061:dmu_tx_assign()) SPL PANIC
[  292.413146] SPL: Showing stack for process 1310
[  292.437567] Pid: 1310, comm: txg_sync Tainted: P      D     2.6.38.2 #7
[  292.438921] Call Trace:
[  292.439893]  [<ffffffffa0224617>] ? spl_debug_dumpstack+0x27/0x40 [spl]
[  292.440974]  [<ffffffffa02259c1>] ? spl_debug_bug+0x81/0xe0 [spl]
[  292.442071]  [<ffffffffa03085c7>] ? dmu_tx_assign+0x607/0x660 [zfs]
[  292.443138]  [<ffffffffa0307e19>] ? dmu_tx_hold_bonus+0x49/0xa0 [zfs]
[  292.499144]  [<ffffffffa0309a7d>] ? dmu_tx_hold_sa+0x19d/0x280 [zfs]
[  292.499165]  [<ffffffffa031ac46>] ? dsl_dataset_prev_snap_txg+0x36/0x70 [zfs]
[  292.499198]  [<ffffffffa03a1898>] ? zfs_inactive+0x118/0x2f0 [zfs]
[  292.499224]  [<ffffffffa03a46c5>] ? zfs_inode_destroy+0xb5/0x110 [zfs]
[  292.499250]  [<ffffffffa03bd30f>] ? zpl_evict_inode+0x2f/0x40 [zfs]
[  292.499254]  [<ffffffff81178544>] ? evict+0x24/0xc0
[  292.499257]  [<ffffffff81178679>] ? dispose_list+0x59/0xf0
[  292.499260]  [<ffffffff81179463>] ? shrink_icache_memory+0x1c3/0x340
[  292.499264]  [<ffffffff81116aec>] ? shrink_slab+0x11c/0x180
[  292.499267]  [<ffffffff811193f6>] ? do_try_to_free_pages+0x316/0x420
[  292.499270]  [<ffffffff81119775>] ? try_to_free_pages+0x85/0x100
[  292.499273]  [<ffffffff8110ef84>] ? __alloc_pages_nodemask+0x484/0x850
[  292.499278]  [<ffffffff81144285>] ? alloc_pages_current+0xa5/0x110
[  292.499282]  [<ffffffff8114d852>] ? new_slab+0x282/0x290
[  292.499285]  [<ffffffff8114f07d>] ? __slab_alloc+0x20d/0x460
[  292.499296]  [<ffffffffa022901b>] ? kmem_alloc_debug+0x1fb/0x3a0 [spl]
[  292.499300]  [<ffffffff8114fa82>] ? __kmalloc+0x112/0x140
[  292.499306]  [<ffffffffa022901b>] ? kmem_alloc_debug+0x1fb/0x3a0 [spl]
[  292.499310]  [<ffffffff8105d872>] ? default_wake_function+0x12/0x20
[  292.499314]  [<ffffffff8104a5e9>] ? __wake_up_common+0x59/0x90
[  292.499329]  [<ffffffffa02e9520>] ? dbuf_write_ready+0x0/0x480 [zfs]
[  292.499472]  [<ffffffffa02ed980>] ? dbuf_write_done+0x0/0xa40 [zfs]
[  292.499488]  [<ffffffffa02e4761>] ? arc_write+0xa1/0x290 [zfs]
[  292.499516]  [<ffffffffa02fc946>] ? dmu_write_policy+0xf6/0x2c0 [zfs]
[  292.499532]  [<ffffffffa02ebaf8>] ? dbuf_write.clone.2+0x208/0x720 [zfs]
[  292.499547]  [<ffffffffa02e9520>] ? dbuf_write_ready+0x0/0x480 [zfs]
[  292.499562]  [<ffffffffa02ed980>] ? dbuf_write_done+0x0/0xa40 [zfs]
[  292.499577]  [<ffffffffa02f3910>] ? dbuf_sync_leaf+0x140/0xa50 [zfs]
[  292.499592]  [<ffffffffa02f4ac2>] ? dbuf_sync_indirect+0x1e2/0x4a0 [zfs]
[  292.499608]  [<ffffffffa02f4868>] ? dbuf_sync_list+0x98/0x110 [zfs]
[  292.499629]  [<ffffffffa0317050>] ? dnode_sync+0x850/0x1e80 [zfs]
[  292.499648]  [<ffffffffa030f6b1>] ? dnode_verify+0x2f1/0xa00 [zfs]
[  292.499668]  [<ffffffffa0317050>] ? dnode_sync+0x850/0x1e80 [zfs]
[  292.499692]  [<ffffffffa0340096>] ? refcount_add+0x16/0x20 [zfs]
[  292.499710]  [<ffffffffa0300504>] ? dmu_objset_sync_dnodes+0xa4/0x290 [zfs]
[  292.499728]  [<ffffffffa0300d2d>] ? dmu_objset_sync+0x1bd/0x5b0 [zfs]
[  292.499744]  [<ffffffffa02fdf70>] ? dmu_objset_write_ready+0x0/0x160 [zfs]
[  292.499761]  [<ffffffffa02fe710>] ? dmu_objset_write_done+0x0/0x170 [zfs]
[  292.499782]  [<ffffffffa031a630>] ? dsl_dataset_sync+0x80/0x170 [zfs]
[  292.499803]  [<ffffffffa032e9fd>] ? dsl_pool_sync+0x12d/0x7f0 [zfs]
[  292.499828]  [<ffffffffa034a1e3>] ? spa_sync+0x3c3/0xc00 [zfs]
[  292.499832]  [<ffffffff8104a5e9>] ? __wake_up_common+0x59/0x90
[  292.499857]  [<ffffffffa035da06>] ? txg_sync_thread+0x2c6/0x500 [zfs]
[  292.499861]  [<ffffffff810137a9>] ? sched_clock+0x9/0x10
[  292.499865]  [<ffffffff8108bfad>] ? sched_clock_cpu+0xbd/0x110
[  292.499889]  [<ffffffffa035d740>] ? txg_sync_thread+0x0/0x500 [zfs]
[  292.499896]  [<ffffffffa022db91>] ? thread_generic_wrapper+0x81/0xe0 [spl]
[  292.499903]  [<ffffffffa022db10>] ? thread_generic_wrapper+0x0/0xe0 [spl]
[  292.499907]  [<ffffffff81084f66>] ? kthread+0x96/0xa0
[  292.499910]  [<ffffffff8100cce4>] ? kernel_thread_helper+0x4/0x10
[  292.499913]  [<ffffffff81084ed0>] ? kthread+0x0/0xa0
[  292.499916]  [<ffffffff8100cce0>] ? kernel_thread_helper+0x0/0x10
[  292.812151] SPL: Dumping log to /tmp/spl-log.1308823642.1310

@tommycheng
Copy link

Hit this bug again. similar workload on the server.

Jun 24 03:03:44 celia kernel: [50910.652096] BUG: unable to handle kernel NULL pointer dereference at           (null)
Jun 24 03:03:44 celia kernel: [50910.652147] IP: [<ffffffff8111eb66>] balance_pgdat+0x2c6/0x6d0
Jun 24 03:03:44 celia kernel: [50910.652176] PGD 41f08e067 PUD 3eeba6067 PMD 0 
Jun 24 03:03:44 celia kernel: [50910.652200] Oops: 0002 [#1] SMP 
Jun 24 03:03:44 celia kernel: [50910.652217] last sysfs file: /sys/devices/system/cpu/cpu5/cpufreq/stats/total_trans
Jun 24 03:03:44 celia kernel: [50910.652239] CPU 2 
Jun 24 03:03:44 celia kernel: [50910.652247] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables kvm_amd kvm nfsd exportfs nfs lockd parport_pc fscache nfs_acl auth_rpcgss ppdev sunrpc twofish_generic twofish_x86_64 twofish_common serpent xts gf128mul snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat fat snd_hda_intel pl2303 bridge stp edac_core edac_mce_amd k10temp sp5100_tco i2c_piix4 usbserial snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc bonding lp parport sha256_generic cryptd aes_x86_64 aes_generic dm_crypt raid10 raid456 async_pq async_xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 multipath linear dm_raid45 xor btrfs zlib_deflate libcrc32c usb_storage nouveau uas ttm drm_kms_helper ahci drm sata_mv r8169 pata_atiixp libahci i2c_algo_bit video xhci_h
Jun 24 03:03:44 celia kernel: cd pata_jmicron sata_sil24
Jun 24 03:03:44 celia kernel: [50910.652760] 
Jun 24 03:03:44 celia kernel: [50910.652771] Pid: 39, comm: kswapd0 Tainted: P            2.6.38-8-server #42-Ubuntu Gigabyte Technology Co., Ltd. GA-890FXA-UD5/GA-890FXA-UD5
Jun 24 03:03:44 celia kernel: [50910.653446] RIP: 0010:[<ffffffff8111eb66>]  [<ffffffff8111eb66>] balance_pgdat+0x2c6/0x6d0
Jun 24 03:03:44 celia kernel: [50910.654092] RSP: 0018:ffff88041b36fda0  EFLAGS: 00010282
Jun 24 03:03:44 celia kernel: [50910.654730] RAX: 00000000000002ed RBX: 0000000000000000 RCX: 0000000000000000
Jun 24 03:03:44 celia kernel: [50910.655376] RDX: 0000000000297c2c RSI: 00000000000000d0 RDI: 0000000000085d56
Jun 24 03:03:44 celia kernel: [50910.656010] RBP: ffff88041b36fe90 R08: 0000000000000000 R09: 0000000000000000
Jun 24 03:03:44 celia kernel: [50910.656651] R10: 0000000000000004 R11: 0000000000000001 R12: 0000000000000000
Jun 24 03:03:44 celia kernel: [50910.657287] R13: ffff88043fffb000 R14: 0000000000000002 R15: ffff88043fffb000
Jun 24 03:03:44 celia kernel: [50910.657926] FS:  00007fc46ac1d700(0000) GS:ffff8800bfa80000(0000) knlGS:0000000000000000
Jun 24 03:03:44 celia kernel: [50910.658568] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 24 03:03:44 celia kernel: [50910.659210] CR2: 0000000000000000 CR3: 00000003efd67000 CR4: 00000000000006e0
Jun 24 03:03:44 celia kernel: [50910.659856] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 24 03:03:44 celia kernel: [50910.660501] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 24 03:03:44 celia kernel: [50910.661150] Process kswapd0 (pid: 39, threadinfo ffff88041b36e000, task ffff88041c83c4a0)
Jun 24 03:03:44 celia kernel: [50910.661794] Stack:
Jun 24 03:03:44 celia kernel: [50910.662047]  ffff88041b36fe20 ffff88041c83c4a0 ffff88041b36fdd0 ffff88041b36fe44
Jun 24 03:03:44 celia kernel: [50910.662047]  ffff88041b36feac 0000000000000000 0000000100000000 0000000000297c2c
Jun 24 03:03:44 celia kernel: [50910.662047]  0000000000000000 ffff88041b36fe50 0000000000000000 000000003fffb000
Jun 24 03:03:44 celia kernel: [50910.662047] Call Trace:
Jun 24 03:03:44 celia kernel: [50910.662047]  [<ffffffff8111f0b3>] kswapd+0x143/0x1b0
Jun 24 03:03:44 celia kernel: [50910.662047]  [<ffffffff8111ef70>] ? kswapd+0x0/0x1b0
Jun 24 03:03:44 celia kernel: [50910.662047]  [<ffffffff810871f6>] kthread+0x96/0xa0
Jun 24 03:03:44 celia kernel: [50910.662047]  [<ffffffff8100cde4>] kernel_thread_helper+0x4/0x10
Jun 24 03:03:44 celia kernel: [50910.662047]  [<ffffffff81087160>] ? kthread+0x0/0xa0
Jun 24 03:03:44 celia kernel: [50910.662047]  [<ffffffff8100cde0>] ? kernel_thread_helper+0x0/0x10
Jun 24 03:03:44 celia kernel: [50910.662047] Code: c0 44 89 f1 4c 89 ff 48 c1 e2 03 e8 55 1b ff ff 84 c0 0f 84 d5 02 00 00 48 8b 8d 50 ff ff ff 48 8b 95 48 ff ff ff be d0 00 00 00 <48> c7 01 00 00 00 00 48 8b 7d 80 e8 fa cf ff ff 48 8b 8d 50 ff 
Jun 24 03:03:44 celia kernel: [50910.662047] RIP  [<ffffffff8111eb66>] balance_pgdat+0x2c6/0x6d0
Jun 24 03:03:44 celia kernel: [50910.662047]  RSP <ffff88041b36fda0>
Jun 24 03:03:44 celia kernel: [50910.662047] CR2: 0000000000000000
Jun 24 03:03:44 celia kernel: [50910.674427] ---[ end trace 97259487999c0130 ]---

@rripley
Copy link

rripley commented Jun 29, 2011

I'm having the same problem while loading files from a ZFS filesystem to an OpenAFS volume also on ZFS.

[14432.856175] BUG: unable to handle kernel NULL pointer dereference at           (null)
[14432.856195] IP: [<ffffffff8111da96>] balance_pgdat+0x2c6/0x6d0
[14432.856208] PGD 1b12d7067 PUD 1b12d6067 PMD 0
[14432.856216] Oops: 0002 [#1] SMP
[14432.856221] last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
[14432.856229] CPU 1
[14432.856231] Modules linked in: openafs(P) zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate acpiphp
[14432.856248]
[14432.856252] Pid: 26, comm: kswapd0 Tainted: P            2.6.38-8-virtual #42-Ubuntu
[14432.856260] RIP: e030:[<ffffffff8111da96>]  [<ffffffff8111da96>] balance_pgdat+0x2c6/0x6d0
[14432.856269] RSP: e02b:ffff8801ce7e3da0  EFLAGS: 00010282
[14432.856273] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[14432.856279] RDX: 00000000000547b1 RSI: 00000000000000d0 RDI: 0000000000011557
[14432.856284] RBP: ffff8801ce7e3e90 R08: 0000000000000000 R09: 0000000000000000
[14432.856290] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[14432.856295] R13: ffff8801dfffb000 R14: 0000000000000002 R15: ffff8801dfffb000
[14432.856304] FS:  00007f665a384720(0000) GS:ffff8801dff90000(0000) knlGS:0000000000000000
[14432.856310] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[14432.856315] CR2: 0000000000000000 CR3: 00000001b12d9000 CR4: 0000000000002660
[14432.856321] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[14432.856327] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
[14432.856332] Process kswapd0 (pid: 26, threadinfo ffff8801ce7e2000, task ffff8801ce7d96c0)
[14432.856338] Stack:
[14432.856341]  0000000000000001 ffff8801ce7d96c0 0000000000000000 ffff8801ce7e3e44
[14432.856350]  ffff8801ce7e3eac 0000000000000000 0000000100000000 00000000000547b1
[14432.856359]  0000000000000000 ffff8801ce7e3e50 0000000000000000 0000000081087397
[14432.856368] Call Trace:
[14432.856374]  [<ffffffff8111dfd3>] kswapd+0x133/0x1a0
[14432.856379]  [<ffffffff8111dea0>] ? kswapd+0x0/0x1a0
[14432.872866]  [<ffffffff81086ac6>] kthread+0x96/0xa0
[14432.872866]  [<ffffffff8100cde4>] kernel_thread_helper+0x4/0x10
[14432.872866]  [<ffffffff815dc621>] ? retint_restore_args+0x5/0x6
[14432.872866]  [<ffffffff8100cde0>] ? kernel_thread_helper+0x0/0x10
[14432.872866] Code: c0 44 89 f1 4c 89 ff 48 c1 e2 03 e8 65 1b ff ff 84 c0 0f 84 d5 02 00 00 48 8b 8d 50 ff ff ff 48 8b 95 48 ff ff ff be d0 00 00 00 <48> c7 01 00 00 00 00 48 8b 7d 80 e8 fa cf ff ff 48 8b 8d 50 ff
[14432.872866] RIP  [<ffffffff8111da96>] balance_pgdat+0x2c6/0x6d0
[14432.872866]  RSP <ffff8801ce7e3da0>
[14432.872866] CR2: 0000000000000000
[14432.883056] ---[ end trace 106d4dd43ddcd9ee ]---

@behlendorf
Copy link
Contributor Author

I was able to recreate this issue once under RHEL6.1 with xfstests test 224 which completely fills the filesystem as a stress/correctness test.

@tomharvey
Copy link

Got this while rsyncing from a 4 disk RAIDZ volume to ext4 volume, transferred around 200GB (around 8000 files) before my system locked up, with a server load average (1min) of 32 at the time.

Server is a 4core Xeon X3430 with 8GB RAM

Sep 5 16:12:43 server_name kernel: [394279.482080] BUG: unable to handle kernel NULL pointer dereference at (null)
Sep 5 16:12:43 server_name kernel: [394279.482368] IP: [] balance_pgdat+0x50c/0x6d0
Sep 5 16:12:43 server_name kernel: [394279.482646] PGD 0
Sep 5 16:12:43 server_name kernel: [394279.482918] Oops: 0002 [#1] SMP
Sep 5 16:12:43 server_name kernel: [394279.483310] last sysfs file: /sys/devices/pci0000:00/0000:00:1c.7/0000:06:00.0/local_cpus
Sep 5 16:12:43 server_name kernel: [394279.483753] CPU 2
Sep 5 16:12:43 server_name kernel: [394279.484195] Modules linked in: sha1_generic arc4 ppp_mppe ppp_async crc_ccitt zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables fbcon tileblit font bitblit softcursor vga16fb vgastate snd_hda_intel lp snd_hda_codec parport snd_hwdep nouveau snd_pcm snd_timer ttm serio_raw drm_kms_helper snd drm soundcore psmouse i2c_algo_bit snd_page_alloc raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 mpt2sas multipath scsi_transport_sas ahci linear e1000e
Sep 5 16:12:43 server_name kernel: [394279.486354] Pid: 52, comm: kswapd0 Tainted: P 2.6.32-28-server #55-Ubuntu System Product Name
Sep 5 16:12:43 server_name kernel: [394279.486921] RIP: 0010:[] [] balance_pgdat+0x50c/0x6d0
Sep 5 16:12:43 server_name kernel: [394279.487487] RSP: 0018:ffff880235247d60 EFLAGS: 00010286
Sep 5 16:12:43 server_name kernel: [394279.488053] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000001d604
Sep 5 16:12:43 server_name kernel: [394279.488622] RDX: 0000000000002aba RSI: 00000000000000d0 RDI: 0000000000001bbf
Sep 5 16:12:43 server_name kernel: [394279.489190] RBP: ffff880235247e50 R08: 0000000000000002 R09: 0000000000000000
Sep 5 16:12:43 server_name kernel: [394279.489752] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
Sep 5 16:12:43 server_name kernel: [394279.490317] R13: ffff880000018000 R14: 0000000000000000 R15: 0000000000000002
Sep 5 16:12:43 server_name kernel: [394279.490881] FS: 0000000000000000(0000) GS:ffff880008e40000(0000) knlGS:0000000000000000
Sep 5 16:12:43 server_name kernel: [394279.491444] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Sep 5 16:12:43 server_name kernel: [394279.492000] CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006e0
Sep 5 16:12:43 server_name kernel: [394279.492569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 5 16:12:43 server_name kernel: [394279.493136] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 5 16:12:43 server_name kernel: [394279.493694] Process kswapd0 (pid: 52, threadinfo ffff880235246000, task ffff88023633ade0)
Sep 5 16:12:43 server_name kernel: [394279.494250] Stack:
Sep 5 16:12:43 server_name kernel: [394279.494803] ffff880235247da0 ffff880235247dc0 0000000000000000 ffff88023633ade0
Sep 5 16:12:43 server_name kernel: [394279.494817] <0> ffff880235247e14 0000000000011de0 0000000000082314 0000000000000000
Sep 5 16:12:43 server_name kernel: [394279.495386] <0> 0000000000000000 00000000815591e8 ffff880000018000 0000000c00000001
Sep 5 16:12:43 server_name kernel: [394279.496526] Call Trace:
Sep 5 16:12:43 server_name kernel: [394279.497101] [] ? isolate_pages_global+0x0/0x50
Sep 5 16:12:43 server_name kernel: [394279.497675] [] kswapd+0xfe/0x150
Sep 5 16:12:43 server_name kernel: [394279.498251] [] ? autoremove_wake_function+0x0/0x40
Sep 5 16:12:43 server_name kernel: [394279.498831] [] ? kswapd+0x0/0x150
Sep 5 16:12:43 server_name kernel: [394279.499409] [] kthread+0x96/0xa0
Sep 5 16:12:43 server_name kernel: [394279.499987] [] child_rip+0xa/0x20
Sep 5 16:12:43 server_name kernel: [394279.500560] [] ? kthread+0x0/0xa0
Sep 5 16:12:43 server_name kernel: [394279.501124] [] ? child_rip+0x0/0x20
Sep 5 16:12:43 server_name kernel: [394279.501677] Code: f9 48 8d 44 98 01 48 c1 e0 04 4a 8b 14 28 48 c1 e2 03 e8 08 4e ff ff 85 c0 0f 84 40 01 00 00 48 8b 85 48 ff ff ff be d0 00 00 00 <48> c7 00 00 00 00 00 48 8b 95 40 ff ff ff 48 8b bd 70 ff ff ff
Sep 5 16:12:43 server_name kernel: [394279.502921] RIP [] balance_pgdat+0x50c/0x6d0
Sep 5 16:12:43 server_name kernel: [394279.503494] RSP
Sep 5 16:12:43 server_name kernel: [394279.504062] CR2: 0000000000000000
Sep 5 16:12:43 server_name kernel: [394279.506027] ---[ end trace 4328978de521155c ]---

@Rudd-O
Copy link
Contributor

Rudd-O commented Sep 8, 2011

Same problem here:

[17978.602553] BUG: unable to handle kernel NULL pointer dereference at (null)
[17978.602617] IP: [] balance_pgdat+0x2d3/0x574
[17978.602674] PGD 36601067 PUD 2d528067 PMD 0
[17978.602716] Oops: 0002 [#1] SMP
[17978.602751] CPU 1
[17978.602769] Modules linked in: tcp_lp fuse ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM bridge stp llc tun nfs lockd fscache auth_rpcgss nfs_acl sunrpc inet_diag bnep bluetooth xt_physdev xt_state ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_mangle hwmon_vid coretemp uvcvideo videodev snd_hda_codec_conexant media v4l2_compat_ioctl32 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device arc4 snd_pcm ath9k mac80211 ath9k_common ath9k_hw ath snd_timer sparse_keymap cfg80211 iTCO_wdt snd joydev i2c_i801 soundcore iTCO_vendor_support rfkill snd_page_alloc atl1c cpufreq_ondemand acpi_cpufreq freq_table mperf zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
[17978.603481]
[17978.603481] Pid: 26, comm: kswapd0 Tainted: P 2.6.40-4.fc15.x86_64 #1 TOSHIBA Satellite C655/Portable PC
[17978.603481] RIP: 0010:[] [] balance_pgdat+0x2d3/0x574
[17978.603481] RSP: 0018:ffff8800b1f37d50 EFLAGS: 00010286
[17978.603481] RAX: 0000000000000000 RBX: ffff8800b7cd3000 RCX: 000000000003eb97
[17978.603481] RDX: 000000000008a6a9 RSI: 0000000000000064 RDI: ffff8800b1f37de8
[17978.603481] RBP: ffff8800b1f37e30 R08: 0000000000000001 R09: 0000000000000000
[17978.603481] R10: 0000000000000080 R11: 0000000000000080 R12: 0000000000000000
[17978.603481] R13: 0000000000000001 R14: 0000000000000000 R15: 000000000000000c
[17978.603481] FS: 0000000000000000(0000) GS:ffff8800b7880000(0000) knlGS:0000000000000000
[17978.603481] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[17978.603481] CR2: 0000000000000000 CR3: 0000000036600000 CR4: 00000000000406e0
[17978.603481] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[17978.603481] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[17978.603481] Process kswapd0 (pid: 26, threadinfo ffff8800b1f36000, task ffff8800b2b25cc0)
[17978.603481] Stack:
[17978.603481] ffff8800b1f37d90 0000000000000000 0000000100a99730 000000000008a6a9
[17978.603481] ffff8800b1f37eac 0000000000000000 ffff8800b7cd3000 0000000000000000
[17978.603481] ffff8800b1f37db0 0000000000000020 0000000000000000 ffffffffffffffff
[17978.603481] Call Trace:
[17978.603481] [] kswapd+0x2b4/0x2f0
[17978.603481] [] ? remove_wait_queue+0x3a/0x3a
[17978.603481] [] ? balance_pgdat+0x574/0x574
[17978.603481] [] kthread+0x84/0x8c
[17978.603481] [] kernel_thread_helper+0x4/0x10
[17978.603481] [] ? kthread_worker_fn+0x148/0x148
[17978.603481] [] ? gs_change+0x13/0x13
[17978.603481] Code: 43 ff ff 84 c0 75 7d 48 8d 95 68 ff ff ff 48 89 de 44 89 ff e8 bf ee ff ff 48 8b 85 28 ff ff ff 48 8b 95 38 ff ff ff 48 8d 7d b8
[17978.603481] c7 00 00 00 00 00 48 8b b5 68 ff ff ff e8 ec d3 ff ff 48 8b
[17978.603481] RIP [] balance_pgdat+0x2d3/0x574
[17978.603481] RSP
[17978.603481] CR2: 0000000000000000
[17978.665053] ---[ end trace b667bb698cb08150 ]---


Max arc size set to 256 MB. Happens when I am using more than ten Chrome tabs and memory use is high.

@speed47
Copy link
Contributor

speed47 commented Oct 27, 2011

I'm hitting this bug as well.

Hardware/Software: Atom 330, 2 Gb RAM, Ubuntu 11.10 server 32 bits but with the debian 64-bits-on-32-bits-userland kernel (3.0.0-2-amd64). ZFS modules 0.6rc6 built from ubuntu's ppa 0.6.0.34-0ubuntu1~oneiric1

zpool config:

    NAME        STATE     READ WRITE CKSUM
    zfs         ONLINE       0     0     0
      sda2      ONLINE       0     0     0
      sdb2      ONLINE       0     0     0

Operation: rsync from an ext4 partition on 1 drive (sdc) to a zfs mountpoint, source and destination are local.
Memory usage was reasonable when the bug happened (< 1 Gb), I don't use any swap space.

2011-10-24T06:50:16.493310+02:00 nas kernel: [111275.349045] BUG: unable to handle kernel NULL pointer dereference at           (null)
2011-10-24T06:50:16.493388+02:00 nas kernel: [111275.349060] IP: [<ffffffff810c493b>] balance_pgdat+0x2a0/0x514
2011-10-24T06:50:16.493403+02:00 nas kernel: [111275.349079] PGD 61d44067 PUD 194f9067 PMD 0 
2011-10-24T06:50:16.493416+02:00 nas kernel: [111275.349090] Oops: 0002 [#1] SMP 
2011-10-24T06:50:16.493430+02:00 nas kernel: [111275.349099] CPU 0 
2011-10-24T06:50:16.493479+02:00 nas kernel: [111275.349103] Modules linked in: xfs sha256_generic cryptd aes_x86_64 aes_generic cbc xt_multiport xt_comment nf_conntrack_ipv6 nf_defrag_ipv6 xt_state xt_tcpudp nls_utf8 nls_cp437 vfat fat nf_conntrack_sane ip6table_security ip6table_filter ip6table_mangle ip6table_raw ip6_tables iptable_security iptable_filter iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_raw ip_tables x_tables parport_pc ppdev binfmt_misc dm_crypt nvidia(P) zfs(P) zcommon(P) znvpair(P) zavl(P) snd_hda_codec_hdmi zunicode(P) snd_hda_codec_realtek spl(O) snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event dm_multipath snd_seq dm_mod scsi_dh snd_timer snd_seq_device psmouse serio_raw evdev snd soundcore wmi shpchp snd_page_alloc pci_hotplug asus_atk0110 processor i2c_nforce2 thermal_sys coretemp smsc47m192 hwmon_vid i2c_i801 i2c_core lp parport loop ext4 mbcache jbd2 crc16 raid10 raid456 async_raid6_recov asy
2011-10-24T06:50:16.493503+02:00 nas kernel: nc_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod btrfs zlib_deflate crc32c libcrc32c usbhid hid sg sd_mod crc_t10dif ata_generic usb_storage uas ahci libahci libata ehci_hcd ohci_hcd r8169 scsi_mod mii usbcore button
2011-10-24T06:50:16.493518+02:00 nas kernel: [111275.349317] 
2011-10-24T06:50:16.493534+02:00 nas kernel: [111275.349324] Pid: 32, comm: kswapd0 Tainted: P           O 3.0.0-2-amd64 #1 System manufacturer System Product Name/AT3IONT-I
2011-10-24T06:50:16.493552+02:00 nas kernel: [111275.349336] RIP: 0010:[<ffffffff810c493b>]  [<ffffffff810c493b>] balance_pgdat+0x2a0/0x514
2011-10-24T06:50:16.493565+02:00 nas kernel: [111275.349348] RSP: 0018:ffff88006a6d1d50  EFLAGS: 00010282
2011-10-24T06:50:16.493602+02:00 nas kernel: [111275.349354] RAX: 0000000000000000 RBX: ffff88006ff87000 RCX: 000000000000a915
2011-10-24T06:50:16.493619+02:00 nas kernel: [111275.349360] RDX: 000000000002c4d8 RSI: 0000000000000064 RDI: ffff88006a6d1df0
2011-10-24T06:50:16.493635+02:00 nas kernel: [111275.349366] RBP: ffff88006ff87000 R08: 0000000000000001 R09: 0000000000000000
2011-10-24T06:50:16.493648+02:00 nas kernel: [111275.349372] R10: 0000000000000001 R11: 00000000000000b9 R12: 0000000000000001
2011-10-24T06:50:16.493663+02:00 nas kernel: [111275.349378] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
2011-10-24T06:50:16.493678+02:00 nas kernel: [111275.349386] FS:  0000000000000000(0000) GS:ffff88006fc00000(0000) knlGS:0000000000000000
2011-10-24T06:50:16.493691+02:00 nas kernel: [111275.349392] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2011-10-24T06:50:16.493706+02:00 nas kernel: [111275.349398] CR2: 0000000000000000 CR3: 00000000194f5000 CR4: 00000000000006f0
2011-10-24T06:50:16.493721+02:00 nas kernel: [111275.349405] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2011-10-24T06:50:16.493736+02:00 nas kernel: [111275.349411] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2011-10-24T06:50:16.493766+02:00 nas kernel: [111275.349418] Process kswapd0 (pid: 32, threadinfo ffff88006a6d0000, task ffff88006d2f8930)
2011-10-24T06:50:16.493780+02:00 nas kernel: [111275.349423] Stack:
2011-10-24T06:50:16.493795+02:00 nas kernel: [111275.349427]  0000000000000000 0000000000000000 ffff88006a6d1eac 000000008103b011
2011-10-24T06:50:16.493811+02:00 nas kernel: [111275.349440]  000000000002c4d8 000000016fc12800 0000000000000000 00ff88000000000c
2011-10-24T06:50:16.493827+02:00 nas kernel: [111275.349451]  ffffffff81038390 ffff88006a6d1de0 0000000000000000 0000000000000000
2011-10-24T06:50:16.493839+02:00 nas kernel: [111275.349463] Call Trace:
2011-10-24T06:50:16.493855+02:00 nas kernel: [111275.349476]  [<ffffffff81038390>] ? need_resched+0x1a/0x23
2011-10-24T06:50:16.493872+02:00 nas kernel: [111275.349486]  [<ffffffff810c4e63>] ? kswapd+0x2b4/0x2cd
2011-10-24T06:50:16.493887+02:00 nas kernel: [111275.349496]  [<ffffffff8105ff4b>] ? wake_up_bit+0x23/0x23
2011-10-24T06:50:16.493917+02:00 nas kernel: [111275.349505]  [<ffffffff810c4baf>] ? balance_pgdat+0x514/0x514
2011-10-24T06:50:16.493932+02:00 nas kernel: [111275.349514]  [<ffffffff810c4baf>] ? balance_pgdat+0x514/0x514
2011-10-24T06:50:16.493949+02:00 nas kernel: [111275.349522]  [<ffffffff8105fad3>] ? kthread+0x7a/0x82
2011-10-24T06:50:16.493964+02:00 nas kernel: [111275.349533]  [<ffffffff81338864>] ? kernel_thread_helper+0x4/0x10
2011-10-24T06:50:16.493978+02:00 nas kernel: [111275.349542]  [<ffffffff8105fa59>] ? kthread_worker_fn+0x149/0x149
2011-10-24T06:50:16.493992+02:00 nas kernel: [111275.349551]  [<ffffffff81338860>] ? gs_change+0x13/0x13
2011-10-24T06:50:16.494009+02:00 nas kernel: [111275.349555] Code: 1a 47 ff ff 84 c0 75 70 8b 7c 24 38 48 8d 54 24 50 48 89 de e8 8c ef ff ff 48 8b 44 24 30 48 8b 54 24 20 48 8d bc 24 a0 00 00 00 
2011-10-24T06:50:16.494028+02:00 nas kernel: [111275.349645] RIP  [<ffffffff810c493b>] balance_pgdat+0x2a0/0x514
2011-10-24T06:50:16.494040+02:00 nas kernel: [111275.349655]  RSP <ffff88006a6d1d50>
2011-10-24T06:50:16.494074+02:00 nas kernel: [111275.349659] CR2: 0000000000000000
2011-10-24T06:50:16.494123+02:00 nas kernel: [111275.349666] ---[ end trace a61f6bfa47c2ea02 ]---

@dechamps
Copy link
Contributor

I got hit on Debian Unstable, standard Debian 3.0.0-2-amd64 kernel, zfs 0.6.0-rc6, spl 0.6.0-rc6:

[88777.938822] BUG: unable to handle kernel NULL pointer dereference at           (null)
[88777.939455] IP: [<ffffffff810c560b>] balance_pgdat+0x2a0/0x514
[88777.940080] PGD 0
[88777.940695] Oops: 0002 [#1] SMP
[88777.941316] CPU 6
[88777.941321] Modules linked in: fuse sr_mod cdrom usb_storage uas ext4 mbcache jbd2 crc16 sha256_generic dm_crypt sg acpi_cpufreq mperf cpufreq_conservative cpufreq_userspace cpufreq_powersave cpufreq_stats softdog sch_prio sch_htb aesni_intel cryptd aes_x86_64 aes_generic 8021q garp bridge stp xt_DSCP xt_CLASSIFY iptable_mangle ipt_REDIRECT ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 xt_addrtype ipt_REJECT xt_owner xt_tcpudp iptable_filter ip_tables x_tables nls_utf8 nls_cp437 vfat fat snd_hda_codec_hdmi snd_hda_codec_realtek arc4 ath9k mac80211 snd_hda_intel ath9k_common ath9k_hw snd_hda_codec ath i915 snd_hwdep cfg80211 drm_kms_helper drm eeepc_wmi snd_pcm asus_wmi snd_timer r8169 psmouse sparse_keymap mxm_wmi snd i2c_algo_bit pci_hotplug evdev e1000e rfkill soundcore pcspkr serio_raw i2c_i801 mii i2c_core wmi processor ath3k video thermal_sys snd_page_alloc button zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl(O) zlib_deflate sd_mod crc_t10dif ehci_hcd ahci libahci firewire_ohci libata firewire_core crc_itu_t xhci_hcd scsi_mod usbcore dm_mirror dm_region_hash dm_log dm_mod
[88777.948582]
[88777.949361] Pid: 45, comm: kswapd0 Tainted: P           O 3.0.0-2-amd64 #1 System manufacturer System Product Name/P8Z68-V PRO
[88777.950978] RIP: 0010:[<ffffffff810c560b>]  [<ffffffff810c560b>] balance_pgdat+0x2a0/0x514
[88777.951811] RSP: 0018:ffff880419e15d50  EFLAGS: 00010282
[88777.952640] RAX: 0000000000000000 RBX: ffff88042fdfb000 RCX: 000000000006298e
[88777.953485] RDX: 00000000001b5e29 RSI: 0000000000000064 RDI: ffff880419e15df0
[88777.954327] RBP: ffff88042fdfb000 R08: 0000000000000001 R09: 0000000000000000
[88777.955167] R10: 0000000000000002 R11: 0000000000000028 R12: 0000000000000001
[88777.955996] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
[88777.956818] FS:  0000000000000000(0000) GS:ffff88042fb80000(0000) knlGS:0000000000000000
[88777.957643] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[88777.958461] CR2: 0000000000000000 CR3: 0000000001603000 CR4: 00000000000406e0
[88777.959276] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[88777.960094] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[88777.960910] Process kswapd0 (pid: 45, threadinfo ffff880419e14000, task ffff880419e13850)
[88777.961733] Stack:
[88777.962547]  0000000000000000 0000000000000000 ffff880419e15eac 000000008103b0a2
[88777.963383]  00000000001b5e29 000000012fa12800 0000000000000000 00ff88040000000c
[88777.964207]  0000000000000006 ffff880419e15de0 0000000000000000 0000000000000000
[88777.965017] Call Trace:
[88777.965810]  [<ffffffff810c5b33>] ? kswapd+0x2b4/0x2cd
[88777.966596]  [<ffffffff8106023f>] ? wake_up_bit+0x23/0x23
[88777.967365]  [<ffffffff810c587f>] ? balance_pgdat+0x514/0x514
[88777.968120]  [<ffffffff810c587f>] ? balance_pgdat+0x514/0x514
[88777.968852]  [<ffffffff8105fdc7>] ? kthread+0x7a/0x82
[88777.969575]  [<ffffffff8133d1e4>] ? kernel_thread_helper+0x4/0x10
[88777.970292]  [<ffffffff8105fd4d>] ? kthread_worker_fn+0x149/0x149
[88777.971003]  [<ffffffff8133d1e0>] ? gs_change+0x13/0x13
[88777.971715] Code: f6 46 ff ff 84 c0 75 70 8b 7c 24 38 48 8d 54 24 50 48 89 de e8 8c ef ff ff 48 8b 44 24 30 48 8b 54 24 20 48 8d bc 24 a0 00 00 00
[88777.972513]  c7 00 00 00 00 00 48 8b 74 24 50 e8 90 d6 ff ff 48 8b 4c 24
[88777.974060] RIP  [<ffffffff810c560b>] balance_pgdat+0x2a0/0x514
[88777.974835]  RSP <ffff880419e15d50>
[88777.975596] CR2: 0000000000000000
[88777.978688] ---[ end trace 60239cd77040da02 ]---

I got this after ~1 day of uptime (24.66 hours, according to the log). No particular workload, the pool must have been mostly idle when this happened.

From past experiences I get the feeling that the crash happens after some fixed amount of work, meaning it will crash quickly if you stress the pool continuously, but survive longer if you're not using it much.

@cshei
Copy link

cshei commented Oct 29, 2011

Just encountered this as well with 0.6.0-rc6:

[11768.290396] BUG: unable to handle kernel NULL pointer dereference at (null) [11768.290427] IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff [11768.290450] PGD 20de6c067 PUD 1fa9d5067 PMD 0 [11768.290467] Oops: 0002 [#1] SMP [11768.290481] CPU 0 [11768.290488] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zl sock vmmon vmhgfs vmci i2c_i801 rtc_cmos [11768.290545] [11768.290552] Pid: 668, comm: kswapd0 Tainted: P 3.0.6-gentoo #1 LENOVO 4286CTO/4286CTO [11768.290580] RIP: 0010:[<ffffffff8109c7ab>] [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff [11768.290604] RSP: 0018:ffff880214b39d40 EFLAGS: 00010282 [11768.290620] RAX: 0000000000000000 RBX: ffff88021e5fb000 RCX: 0000000000026d3c [11768.290640] RDX: 000000000015774b RSI: 0000000000000064 RDI: 00000000000234c3 [11768.290658] RBP: ffff88021e5fb000 R08: 0000000000000001 R09: 0000000000000000 [11768.290677] R10: 0000000000000004 R11: 0000000000000001 R12: 0000000000000002 [11768.290696] R13: 0000000000000002 R14: 0000000000000000 R15: 000000000000000c [11768.290716] FS: 0000000000000000(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000 [11768.290737] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [11768.290753] CR2: 0000000000000000 CR3: 000000020e0af000 CR4: 00000000000406f0 [11768.290772] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [11768.290791] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [11768.290810] Process kswapd0 (pid: 668, threadinfo ffff880214b38000, task ffff880214b0e090) [11768.290832] Stack: [11768.290838] 0000000000000e00 ffff880200000000 0000000000000000 0000000000000000 [11768.290863] ffff880214b39eac 0000000014b39dd0 000000000015774b 0000000100aed717 [11768.290887] 0000000000000000 00ffffff810453a4 0000000000000006 ffff880214b0e090 [11768.290911] Call Trace: [11768.290921] [<ffffffff8109ccae>] ? kswapd+0x25f/0x278 [11768.290938] [<ffffffff81051433>] ? wake_up_bit+0x23/0x23 [11768.290954] [<ffffffff8109ca4f>] ? balance_pgdat+0x4ff/0x4ff [11768.290970] [<ffffffff8109ca4f>] ? balance_pgdat+0x4ff/0x4ff [11768.290994] [<ffffffff81051023>] ? kthread+0x7a/0x82 [11768.291017] [<ffffffff8151f054>] ? kernel_thread_helper+0x4/0x10 [11768.291043] [<ffffffff81050fa9>] ? kthread_worker_fn+0x135/0x135 [11768.291069] [<ffffffff8151f050>] ? gs_change+0xb/0xb [11768.291090] Code: d0 45 31 c0 48 03 55 10 e8 e9 6a ff ff 84 c0 75 74 48 8d 54 24 60 48 89 ee 44 89 ff e8 44 f2 ff ff 48 8b 44 24 40 48 8b 54 24 30 91190] c7 00 00 00 00 00 48 8d bc 24 b0 00 00 00 48 8b 74 24 60 e8 [11768.291239] RIP [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff [11768.291257] RSP <ffff880214b39d40> [11768.291268] CR2: 0000000000000000 [11768.401666] ---[ end trace f66ba14c95741335 ]---

It was scrubbing a zpool and running a Windows 7 VMWare VM (from the same zpool) at the time.

@dechamps
Copy link
Contributor

Ok, I have some more info. Inspired by the work of gunnarbeutner I noticed in __alloc_pages_direct_reclaim() the following lines:

current->reclaim_state = &reclaim_state;
*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
current->reclaim_state = NULL;

What's interesting is that the kernel doesn't check if current->reclaim_state is NULL before doing this. Meaning that if there is a valid current->reclaim_state when we enter __alloc_pages_direct_reclaim() then all hell breaks loose: the new reclaim_state erases the old one, and then it is set to NULL, which would cause this bug. After that the old reclaim_state is gone in the wild, which can't be good.

So I added the following debug statements before the aforementioned code:

if (current->reclaim_state) {
    printk(KERN_WARNING "entering __alloc_pages_direct_reclaim with a non-NULL reclaim_state... this sounds wrong!\n");
    dump_stack();
}

Then I waited. Several hours later, the debug message appears in the log and immediately after, the NULL dereference occurs. What's very interesting is that, contrary to gunnarbeutner, I have ZFS functions in the stack trace!

[25219.619148] entering __alloc_pages_direct_reclaim with a non-NULL reclaim_state... this sounds wrong!
[25219.619757] Pid: 45, comm: kswapd0 Tainted: P           O 3.0.0-2-amd64 #1
[25219.620381] Call Trace:
[25219.620992]  [<ffffffff810bcd41>] ? __alloc_pages_nodemask+0x4d9/0x7c1
[25219.621609]  [<ffffffff810ec474>] ? kmem_getpages+0x55/0xf0
[25219.622218]  [<ffffffff810ecc2e>] ? fallback_alloc+0x129/0x1c1
[25219.622845]  [<ffffffffa0073ba6>] ? kmem_alloc_debug+0x73/0xcf [spl]
[25219.623475]  [<ffffffff810ed31b>] ? __kmalloc+0x95/0xf0
[25219.624065]  [<ffffffffa0073ba6>] ? kmem_alloc_debug+0x73/0xcf [spl]
[25219.624639]  [<ffffffff8103840a>] ? should_resched+0x5/0x24
[25219.625215]  [<ffffffffa02084d8>] ? dbuf_dirty+0x1c3/0x553 [zfs]
[25219.625786]  [<ffffffff810720f0>] ? arch_local_irq_save+0x14/0x1d
[25219.626364]  [<ffffffffa0233588>] ? sa_attr_op+0xd9/0x30d [zfs]
[25219.626941]  [<ffffffffa0224b68>] ? dsl_dir_tempreserve_space+0x188/0x1b3 [zfs]
[25219.627531]  [<ffffffffa0233817>] ? sa_bulk_update_impl+0x5b/0x8e [zfs]
[25219.628114]  [<ffffffff810383fc>] ? need_resched+0x1a/0x23
[25219.628678]  [<ffffffff8103840a>] ? should_resched+0x5/0x24
[25219.629249]  [<ffffffffa023392b>] ? sa_update+0x44/0x5a [zfs]
[25219.629828]  [<ffffffffa0266f40>] ? zfs_inactive+0xdf/0x145 [zfs]
[25219.630396]  [<ffffffff8103840a>] ? should_resched+0x5/0x24
[25219.630962]  [<ffffffff8110ede2>] ? evict+0x7b/0x119
[25219.631543]  [<ffffffff8110eeb2>] ? dispose_list+0x32/0x3c
[25219.632085]  [<ffffffff8110f324>] ? shrink_icache_memory+0x2a4/0x2d4
[25219.632609]  [<ffffffff810c2deb>] ? shrink_slab+0xe3/0x155
[25219.633129]  [<ffffffff810c571f>] ? balance_pgdat+0x2d1/0x57a
[25219.633648]  [<ffffffff810c5c7c>] ? kswapd+0x2b4/0x2cd
[25219.634170]  [<ffffffff8106023f>] ? wake_up_bit+0x23/0x23
[25219.634743]  [<ffffffff810c59c8>] ? balance_pgdat+0x57a/0x57a
[25219.635347]  [<ffffffff810c59c8>] ? balance_pgdat+0x57a/0x57a
[25219.635880]  [<ffffffff8105fdc7>] ? kthread+0x7a/0x82
[25219.636435]  [<ffffffff8133d324>] ? kernel_thread_helper+0x4/0x10
[25219.636973]  [<ffffffff8105fd4d>] ? kthread_worker_fn+0x149/0x149
[25219.637514]  [<ffffffff8133d320>] ? gs_change+0x13/0x13
[25219.675551] in balance_pgdat(), current->reclaim_state is NULL after shrink_slab(). Prepare to die!

@dechamps
Copy link
Contributor

In txg_sync_thread(), I noticed this very interesting code:

/*
 * Disable the normal reclaim path for the txg_sync thread.  This
 * ensures the thread will never enter dmu_tx_assign() which can
 * otherwise occur due to direct reclaim.  If this is allowed to
 * happen the system can deadlock.  Direct reclaim call path:
 *
 *   ->shrink_icache_memory->prune_icache->dispose_list->
 *   clear_inode->zpl_clear_inode->zfs_inactive->dmu_tx_assign
 */
current->flags |= PF_MEMALLOC;

This seems targeted at fixing this kind of issue… except it doesn't: we see zfs_inactivesa_update in the stacktrace, meaning that zfs_inactive successfully called dmu_tx_assign although it is running inside a reclaim path. Then sa_update needs memory, and triggers direct reclaim… which erases reclaim_state, and after that, the system is just a ticking bomb waiting to explode.

@dechamps
Copy link
Contributor

Just to check, I added this to txg_sync_thread at the beginning of the main loop:

#ifdef _KERNEL
    if (!(current->flags & PF_MEMALLOC))
        printk(KERN_WARNING "in txg_sync_thread() main loop, we don't have the PF_MEMALLOC flag set!\n");
#endif /* _KERNEL */

Results are clear, right from the boot:

[   32.744189] in txg_sync_thread() main loop, we don't have the PF_MEMALLOC flag set!
[   37.462484] in txg_sync_thread() main loop, we don't have the PF_MEMALLOC flag set!
[   42.466333] in txg_sync_thread() main loop, we don't have the PF_MEMALLOC flag set!
[   47.423092] in txg_sync_thread() main loop, we don't have the PF_MEMALLOC flag set!
[   52.466460] in txg_sync_thread() main loop, we don't have the PF_MEMALLOC flag set!
[   57.490777] in txg_sync_thread() main loop, we don't have the PF_MEMALLOC flag set!
[   62.427296] in txg_sync_thread() main loop, we don't have the PF_MEMALLOC flag set!

Well, that explains it.

@dechamps
Copy link
Contributor

As shown in my stack trace above, in the path leading up to the direct reclaim, this is called as part of zfs_inactive():

sa_update(zp->z_sa_hdl, SA_ZPL_ATIME(zsb), (void *)&zp->z_atime, sizeof (zp->z_atime), tx);

In light of this, I'm thinking that maybe disabling the atime feature might be a workaround for this issue. This is assuming no other code path triggers allocation within a reclaim, which is a bit of a stretch, but it's worth giving it a try so that impacted people get a little more air.

So, assuming I'm right, this workaround should stop the balance_pgdat NULL deref issue from happening, or at least decrease its frequency:

zfs set atime=off <zpool>

I would be glad if some of you could test it and report back.

@speed47
Copy link
Contributor

speed47 commented Oct 30, 2011

Nice digging work !
Will try it on my setup right now, my config seem to be very good at hitting this bug for some reason ...

@behlendorf
Copy link
Contributor Author

Nice leg work on this, I think you've clearly identified the issue is that we are trashing the current->reclaim_state. Exactly how this occurs is still a little less obvious to me, but I'll carefully walk the stack provided to see if it sheds some light on things. In the meanwhile let me comment on your above observations:

In txg_sync_thread(), I noticed this very interesting code: ...

This change eec8164 was made to explicitly prevent a deadlock in the txg_sync_thread. This thread will hold the txg_sync lock while flushing dirty data to disk, if it were to ever attempt to reacquire this lock it would deadlock on itself. In fact, because the Linux VFS layer differs significantly from Solaris VFS this is entirely possible via the direct memory reclaim path described. Any memory allocation could potentially cause this, to avoid this issue entirely direct reclaim is disabled for the txg_sync_thread.

Your stack above is able to call zfs_inactive()->...->dmu_tx_assign() because it is being done in the context of the kswapd thread. This should be safe because kswapd won't be holding the txg_sync lock, however it does appear that somehow in the call stack the current->reclaim_state was damaged.

Your second observation about PF_MEMALLOC somehow getting cleared in txg_sync_thread() is troubling for two reasons. First off, this should be impossible since __alloc_pages_slowpath() checks for this pretty early on and aborts reclaim if it's set. Perhaps there is another call path where it can be cleared?

        /* Avoid recursion of direct reclaim */
        if (current->flags & PF_MEMALLOC)
                goto nopage;

And secondly, because prior to this fix it was quite easy to deadlock the txg_sync_thread. There's every indication that it is actually working as designed, are you absolutely sure you got this check correct?

Anyway, I think you've still hit on the heart of the issue. We need to determine exactly where things go wrong with the current->reclaim_state management.

@dechamps
Copy link
Contributor

dechamps commented Nov 1, 2011

I think you've clearly identified the issue is that we are trashing the current->reclaim_state. Exactly how this occurs is still a little less obvious to me

Anyway, I think you've still hit on the heart of the issue. We need to determine exactly where things go wrong with the current->reclaim_state management.

It seems pretty clear to me. Let me summarize my findings:

  • balance_pgdat(), part of the asynchronous reclaim path in kswapd, uses current->reclaim_state.
  • The direct (synchronous) reclaim path resets current->reclaim_state. More precisely, __alloc_pages_direct_reclaim erases it with its own structure.

Conclusion: if you trigger a direct reclaim while you're already reclaiming (asynchronously) in kswapd, then the reclaim_state used by balance_pgdat() gets corrupted by __alloc_pages_direct_reclaim which replaces it with its own. After that, as one might expect, the system goes down fast.

As a personal note, it boggles my mind that __alloc_pages_direct_reclaim just resets current->reclaim_state without even an assertion or something to make sure that it's not doing something horribly wrong. Maybe we should make this a suggestion to the kernel developers: indeed, having an assertion there would have allowed us to debug this much faster, since we would have more useful stacktraces.

Moving on. There seems to be one predominant path where ZFS has a significant probability of entering direct reclaim in kswapd: allocations needed for atime updates triggered by inode eviction. This is quite visible in the aforementioned stacktrace:

  • __alloc_pages_direct_reclaim (corruption of reclaim_state)
  • kmem_alloc_debug (SPL allocation function, which unfortunately triggers direct reclaim)
  • dbuf_dirty (memory-hungry ZFS function)
  • sa_update (atime update)
  • zfs_inactive (triggers the atime update)
  • zpl_evict_inode (evict_inode superblock operation, called from the kernel)
  • evict (kernel inode eviction function)
  • shrink_slab (kernel reclaiming function)
  • balance_pgdat (kernel higher-level reclaiming function)
  • kswapd

Although atime updates are probably not the only case where ZFS allocates memory inside kswapd, it seems to be the cause of most (if not all) crashes related to this issue. Indeed, ZFS apparently needs relatively large amounts of memory to perform the updates, most probably because of dbuf_dirty which needs to allocate a dbuf to hold the metadata.

The random nature of this issue is a result of the fact that the system only crashes if a direct reclaim is triggered by ZFS inside kswapd, which only happens if there is not enough memory available to satisfy allocation requests from ZFS during eviction (see stacktrace). Most of the time there is still some memory available and/or ZFS doesn't need much memory to do the job, so direct reclaim isn't used and everything is fine. If not, well, congratulations, you've just lost to the Russian roulette.

The fact that rsync is particularly good at triggering this issue is because it deals with huge amounts of files. Consequently, ZFS has to update atime information for a lot of files; as a result, the probability that direct reclaim will be (fatally) used increases. However, in theory this could happen at any moment, making any system a ticking bomb.

It's still too soon to tell, but it seems my system doesn't suffer from this issue anymore after disabling atime on my pool. If nothing happens in the next 48 hours then I'll take it as a confirmation of my diagnosis.

In the mean time, I added some printks to the kmem_alloc_debug function which dumps a stack trace if allocations are being requested while reclaim_state is set. As I said before, such allocations won't necessarily crash the system, but there is a small chance that one of them will, so it's important information. With async enabled I got spammed by reports. With async disabled I only get this stacktrace from time to time (so far):

[60180.748176] in kmem_alloc_debug() from task_alloc:100 but we are reclaiming!Pid: 45, comm: kswapd0 Tainted: P           O 3.0.0-2-amd64 #1
[60180.749751] Call Trace:
[60180.750529]  <IRQ>  [<ffffffffa0072baa>] ? kmem_alloc_debug+0x40/0xf3 [spl]
[60180.751353]  [<ffffffffa0074435>] ? task_alloc+0xe1/0x139 [spl]
[60180.752164]  [<ffffffffa02b60d7>] ? zio_decompress+0x46/0x46 [zfs]
[60180.752968]  [<ffffffffa0074be4>] ? __taskq_dispatch+0xb8/0x150 [spl]
[60180.753780]  [<ffffffffa02b570d>] ? zio_taskq_dispatch+0x95/0xa5 [zfs]
[60180.754593]  [<ffffffffa0289e20>] ? vdev_disk_dio_put+0x4d/0x54 [zfs]
[60180.755396]  [<ffffffffa028a27c>] ? vdev_disk_physio_completion+0x94/0xb7 [zfs]
[60180.756197]  [<ffffffff811b73cb>] ? is_swiotlb_buffer+0x10/0x31
[60180.757013]  [<ffffffff81192604>] ? blk_update_request+0x18f/0x30e
[60180.757826]  [<ffffffffa016f158>] ? ata_scsi_qc_complete+0x2ec/0x302 [libata]
[60180.758628]  [<ffffffff8119279d>] ? blk_update_bidi_request+0x1a/0x67
[60180.759432]  [<ffffffff811928d1>] ? blk_end_bidi_request+0x19/0x57
[60180.760264]  [<ffffffffa00d195c>] ? scsi_io_completion+0x1cb/0x4cf [scsi_mod]
[60180.761079]  [<ffffffff81336f5c>] ? _raw_spin_lock_irqsave+0x9/0x25
[60180.761897]  [<ffffffff811966f0>] ? blk_done_softirq+0x67/0x76
[60180.762723]  [<ffffffff8104be44>] ? __do_softirq+0xb9/0x178
[60180.763548]  [<ffffffff8133d41c>] ? call_softirq+0x1c/0x30
[60180.764375]  [<ffffffff8100a9ef>] ? do_softirq+0x3f/0x84
[60180.765187]  [<ffffffff8104c0b0>] ? irq_exit+0x3f/0xa3
[60180.766025]  [<ffffffff8100a71b>] ? do_IRQ+0x85/0x9e
[60180.766935]  [<ffffffff81337313>] ? common_interrupt+0x13/0x13
[60180.767935]  <EOI>  [<ffffffff810383fc>] ? need_resched+0x1a/0x23
[60180.768958]  [<ffffffff811acd09>] ? __write_lock_failed+0x9/0x20
[60180.769769]  [<ffffffff81072113>] ? arch_write_lock+0xe/0xf
[60180.770384]  [<ffffffff810ddbd0>] ? remove_vm_area+0x2b/0x78
[60180.770964]  [<ffffffff810ddc6c>] ? __vunmap+0x32/0xb2
[60180.771532]  [<ffffffffa0073516>] ? spl_slab_reclaim+0x1c5/0x1eb [spl]
[60180.772101]  [<ffffffffa007358d>] ? spl_kmem_cache_reap_now+0x51/0x69 [spl]
[60180.772674]  [<ffffffffa0073805>] ? __spl_kmem_cache_generic_shrinker.clone.9+0x34/0x79 [spl]
[60180.773255]  [<ffffffff810c2deb>] ? shrink_slab+0xe3/0x155
[60180.773834]  [<ffffffff810c571f>] ? balance_pgdat+0x2d1/0x57a
[60180.774444]  [<ffffffff810c5c7c>] ? kswapd+0x2b4/0x2cd
[60180.775049]  [<ffffffff8106023f>] ? wake_up_bit+0x23/0x23
[60180.775634]  [<ffffffff810c59c8>] ? balance_pgdat+0x57a/0x57a
[60180.776214]  [<ffffffff810c59c8>] ? balance_pgdat+0x57a/0x57a
[60180.776786]  [<ffffffff8105fdc7>] ? kthread+0x7a/0x82
[60180.777350]  [<ffffffff8133d324>] ? kernel_thread_helper+0x4/0x10
[60180.777911]  [<ffffffff8105fd4d>] ? kthread_worker_fn+0x149/0x149
[60180.778497]  [<ffffffff8133d320>] ? gs_change+0x13/0x13

IMO, the morale of the story is: don't try to allocate memory in a thread which has reclaim_state set (meaning: kswapd). That being said, I'm not sure if removing all allocations from the ZFS inode eviction path is practically feasible, but if we could defer metadata (atime) updates until after the reclaiming is done (or in another thread), it would certainly be a step in the right direction. I'll let Brian decide on the best course of action regarding this general problem.

Your second observation about PF_MEMALLOC somehow getting cleared in txg_sync_thread() is troubling for two reasons. First off, this should be impossible since __alloc_pages_slowpath() checks for this pretty early on and aborts reclaim if it's set. Perhaps there is another call path where it can be cleared?

There's a few places in the kernel where PF_MEMALLOC could get cleared. It needs some more digging, although I'm not convinced that this particular phenomenon is related to the issue at hand.

And secondly, because prior to this fix it was quite easy to deadlock the txg_sync_thread. There's every indication that it is actually working as designed, are you absolutely sure you got this check correct?

Well, I don't know how I could have got it wrong, since I copy-pasted the code inside the main loop of txg_sync_thread, after the PF_MEMALLOC flag is set. The printk fired every five seconds, which matches the period of the txg sync. You can easily check it yourself if you want to confirm.

behlendorf added a commit to behlendorf/zfs that referenced this issue Nov 1, 2011
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure.  It may have already been set when entering
zpl_putpage() in which case it must remain set on exit.  In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim.  By clearing
it we allow the following NULL deref to potentially occur.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#287
@behlendorf
Copy link
Contributor Author

It seems pretty clear to me. Let me summarize my findings:

  • balance_pgdat(), part of the asynchronous reclaim path in kswapd, uses current->reclaim_state.
  • The direct (synchronous) reclaim path resets current->reclaim_state. More precisely, __alloc_pages_direct_reclaim > erases it with its own structure.

Conclusion: if you trigger a direct reclaim while you're already reclaiming (asynchronously) in kswapd, then the reclaim_state used by balance_pgdat() gets corrupted by __alloc_pages_direct_reclaim which replaces it with its own. After that, as one might expect, the system goes down fast.

Right, I completely agree. However, what you describe should be impossible. By design kswap is prevented from ever entering the synchronous reclaim path. Notice when the kswapd thread is created the PF_MEMALLOC bit is set in the task struct flags.

static int kswapd(void *p)
{
        ...
        tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;

This flag is then checked early in __alloc_pages_slowpath, which is the only caller of __alloc_pages_direct_reclaim, and when set synchronous reclaim is aborted. Regardless of what sort of memory allocations zfs does, the Linux vm itself should be preventing this. Now that's clearly not what's happening, and I think I see why thanks to your leg work.

For this to occur something has to be mistakenly clearing the PF_MEMALLOC bit in the task struct. As it turns out a quick look at the spl and zfs code turns up two places where PF_MEMALLOC is set and then cleared unconditionally to avoid a deadlock. In both cases it's possible that PF_MEMALLOC was already set when entering the function, in which case we would have just mistakenly cleared it and armed the time bomb. Incidentally, this also explains the other odd symptom you reported of PF_MEMALLOC not being set in txg_sync_thread.

Can you please apply the following two patches and verify that they resolve the issue. If someone has a good test case for this, or just a workload/system which seems to hit this a lot, I'd love the extra testing before merging it to master.

spl patch: behlendorf/spl@940d49e
zfs patch: behlendorf/zfs@80c97dd

@behlendorf
Copy link
Contributor Author

As a personal note, it boggles my mind that __alloc_pages_direct_reclaim just resets current->reclaim_state without even an assertion or something to make sure that it's not doing something horribly wrong. Maybe we should make this a suggestion to the kernel developers: indeed, having an assertion there would have allowed us to debug this much faster, since we would have more useful stacktraces.

Oh, and I absolutely agree. A BUG_ON() would be a very good thing right there.

@dechamps
Copy link
Contributor

dechamps commented Nov 1, 2011

Right, I completely agree. However, what you describe should be impossible. By design kswap is prevented from ever entering the synchronous reclaim path. Notice when the kswapd thread is created the PF_MEMALLOC bit is set in the task struct flags.

This flag is then checked early in __alloc_pages_slowpath, which is the only caller of __alloc_pages_direct_reclaim, and when set synchronous reclaim is aborted.

Oh, right… the kernel folks prevented this. Never underestimate the Power of the Kernel ;)

One thing worries me, though: isn't kmem_alloc_debug never supposed to fail? If the kernel refuses the allocation because it would trigger direct reclaim, what will SPL/ZFS do? We can't just wait for memory to become available, because, well, we're in kswapd, the kernel is expecting us to free memory…

For this to occur something has to be mistakenly clearing the PF_MEMALLOC bit in the task struct. As it turns out a quick look at the spl and zfs code turns up two places where PF_MEMALLOC is set and then cleared unconditionally to avoid a deadlock. In both cases it's possible that PF_MEMALLOC was already set when entering the function, in which case we would have just mistakenly cleared it and armed the time bomb. Incidentally, this also explains the other odd symptom you reported of PF_MEMALLOC not being set in txg_sync_thread.

You're absolutely right. Everything makes sense now :)

Can you please apply the following two patches and verify that they resolve the issue.

I'm a little busy right now, I'll try them as soon as I can. Sounds very promising though.

@behlendorf
Copy link
Contributor Author

One thing worries me, though: isn't kmem_alloc_debug never supposed to fail? If the kernel refuses the allocation because it would trigger direct reclaim, what will SPL/ZFS do? We can't just wait for memory to become available, because, well, we're in kswapd, the kernel is expecting us to free memory…

At the moment the kmem_alloc() implementation in the spl always retries on failure... forever I'm not proud of this, but we really have no choice. As you say kmem_alloc() must succeed, period. The zfs code completely depends on this behavior and thus contains zero error handling for this. In contrast, failure for Linux's native kmalloc() is a down right common event and needs to be handled.

As for the exact case you worried about we should be OK. The kmem_alloc() call run under the kswapd thread may fail repeatedly and retry, but there are other threads running on the system which are allowed to perform direct reclaim. One of them will free the needed memory and then the kmem_alloc() under kswapd will succeed and move on. This should also be a very unlikely event.

I'm a little busy right now, I'll try them as soon as I can. Sounds very promising though.

No problem, I'll let it soak on a few test systems here but the more testing the better. This is actually a pretty safe change regardless.

@Rudd-O
Copy link
Contributor

Rudd-O commented Nov 2, 2011

One quick nontechnical note: It's simply exhilarating to see how you guys are cornering this bug with nail bats. I am privileged enough to have found and be part of this conversation, and I am enjoying every second of this.

If any of you guys stop by the bay area, at any point, drinks are on me.

@speed47
Copy link
Contributor

speed47 commented Nov 2, 2011

I wasn't hit by the bug in the last few days with atime disabled. Nice !
Just re-enabled it, starved my system on RAM a bit by copying stuff to /dev/shm, then ran :

find /zfs/ -type f -print0 | xargs -r0 touch

And bam, kernel NULL deref.
Now let's apply the patch above, and try that for a few hours in a loop + normal usage, with atime enabled.

@gunnarbeutner
Copy link
Contributor

Nice job. The PF_MEMALLOC bit is something I missed during my initial investigation. :)

There are at least half a dozen open tickets which depend on this fix.

@speed47
Copy link
Contributor

speed47 commented Nov 2, 2011

Things sound very good so far. I've applied the patch and added one printk in each two new if-pf_memalloc-already-set block, and :

root@nas:/usr/src# time find /zfs/ -type f -print0 | xargs -r0 touch

real    2m10.195s
user    0m5.072s
sys     1m17.025s

root@nas:~# grep PF_MEMALLOC /var/log/system.log | cut -d" " -f5- | sort | uniq -c
   5543 PF_MEMALLOC squashing averted in kv_alloc !

We really had a lot of occasions to create this deadlock before... Now i'm running the patch without the printk's, and the above loop is running for ~1 hour. No kernel bug so far. I'll let it run for several days, each 10 minutes, to be extra-sure that nothing crashes horribly under different cpu/mem/etc conditions (aka : normal usage). But that would be strange given the changeset.

Kudos, guys.

@dechamps
Copy link
Contributor

dechamps commented Nov 2, 2011

I'm glad to announce that since I applied Brian's patches, my "PF_MEMALLOC not set in txg_sync_thread()" printk doesn't fire anymore.

I just re-enabled atime, let's see how it goes.

@Rudd-O
Copy link
Contributor

Rudd-O commented Nov 3, 2011

Imma test drive your branch fix now. If you dont hear from me, either i am dead or all went well.

@cshei
Copy link

cshei commented Nov 3, 2011

I applied Brian's patch yesterday, re-enabled atime, and left a zfs scrub (on a 1.4 TB pool) and Windows VM running all night... no problems yet.

@Rudd-O
Copy link
Contributor

Rudd-O commented Nov 3, 2011

No problems yet here either.

Sent from my Android phone with K-9 Mail. Please excuse my brevity.

cshei [email protected] wrote:

I applied Brian's patch yesterday, re-enabled atime, and left a zfs scrub (on a 1.4 TB pool) and Windows VM running all night... no problems yet.


Reply to this email directly or view it on GitHub:
#287 (comment)

@behlendorf
Copy link
Contributor Author

Since I'm hearing nothing but good things I'll be merging it in shortly.

behlendorf added a commit that referenced this issue Nov 3, 2011
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure.  It may have already been set when entering
zpl_putpage() in which case it must remain set on exit.  In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim.  By clearing
it we allow the following NULL deref to potentially occur.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff

Signed-off-by: Brian Behlendorf <[email protected]>
Issue #287
@behlendorf
Copy link
Contributor Author

Fixed! Thank you everyone for the help with this, I'm closing the issue the patches have been merged.

dajhorn referenced this issue in zfsonlinux/pkg-zfs Dec 14, 2011
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure.  It may have already been set when entering
kv_alloc() in which case it must remain set on exit.  In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim.  By clearing
it we allow the following NULL deref to potentially occur.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff

Signed-off-by: Brian Behlendorf <[email protected]>
Closes ZFS issue #287
@verygreen
Copy link
Contributor

Sometimes they come back?

I hit what I believe is a similar issue at the same place today:

[67041.744126] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[67041.747037] IP: [<ffffffffa0b2b69b>] dbuf_sync_leaf+0x40b/0x5d0 [zfs]
[67041.747788] PGD 0 
[67041.748363] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[67041.749025] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) ext4 mbcache loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate jbd2 syscopyarea sysfillrect sysimgblt ttm ata_generic drm_kms_helper pata_acpi i2c_piix4 drm ata_piix virtio_balloon libata pcspkr serio_raw virtio_console i2c_core virtio_blk floppy nfsd ip_tables rpcsec_gss_krb5 [last unloaded: libcfs]
[67041.784257] CPU: 1 PID: 7952 Comm: dp_sync_taskq Tainted: P        W  OE  ------------   3.10.0-debug #2
[67041.785607] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[67041.786252] task: ffff88009fc0aa00 ti: ffff8802f5eb0000 task.ti: ffff8802f5eb0000
[67041.795527] RIP: 0010:[<ffffffffa0b2b69b>]  [<ffffffffa0b2b69b>] dbuf_sync_leaf+0x40b/0x5d0 [zfs]
[67041.797529] RSP: 0018:ffff8802f5eb3b28  EFLAGS: 00010207
[67041.798731] RAX: 0000000000000001 RBX: ffff88021b625dc0 RCX: ffff88008e441400
[67041.800393] RDX: 0000000000000000 RSI: ffff880215932c00 RDI: ffff880215932c00
[67041.801088] RBP: ffff8802f5eb3b70 R08: 0000000000000000 R09: 0000000000000000
[67041.801985] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801eb000e00
[67041.803283] R13: ffff88021b625e18 R14: ffff88029a9c2c90 R15: ffff88029a9c2a60
[67041.803973] FS:  0000000000000000(0000) GS:ffff88033e420000(0000) knlGS:0000000000000000
[67041.805197] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[67041.806058] CR2: 0000000000000028 CR3: 0000000001c0e000 CR4: 00000000000006e0
[67041.806809] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[67041.807525] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[67041.808195] Stack:
[67041.808770]  ffff88029dc9ee00 ffff88029dc9eec0 0000000000000650 ffff88028adf9480
[67041.810073]  ffff8801eb000e00 ffff88029a9c2c80 0000000000000001 ffff88029a9c2c90
[67041.811317]  ffff88028adf9480 ffff8802f5eb3ba8 ffffffffa0b2bcdd ffff88028adf9480
[67041.812929] Call Trace:
[67041.813571]  [<ffffffffa0b2bcdd>] dbuf_sync_list+0xbd/0xd0 [zfs]
[67041.814252]  [<ffffffffa0b4f74b>] dnode_sync+0x56b/0xf60 [zfs]
[67041.814925]  [<ffffffff817063d7>] ? _raw_spin_unlock+0x27/0x40
[67041.815609]  [<ffffffffa0b38438>] sync_dnodes_task+0x88/0xc0 [zfs]
[67041.816271]  [<ffffffffa09b1379>] taskq_thread+0x2b9/0x500 [spl]
[67041.816992]  [<ffffffff810b7cc0>] ? wake_up_state+0x20/0x20
[67041.817700]  [<ffffffffa09b10c0>] ? taskq_cancel_id+0x150/0x150 [spl]
[67041.818474]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[67041.819106]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
[67041.819788]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[67041.820426]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
[67041.821147] Code: 4c 89 ef e8 88 d1 fe ff 48 8b 93 f0 01 00 00 48 c7 83 e8 01 00 00 00 00 00 00 4c 39 e2 75 0c e9 92 01 00 00 0f 1f 40 00 48 89 c2 <48> 8b 42 28 4c 39 e0 75 f4 48 83 c2 28 49 8b 44 24 28 48 89 02 
[67041.824756] RIP  [<ffffffffa0b2b69b>] dbuf_sync_leaf+0x40b/0x5d0 [zfs]

No allocation failures or the like.

Now, this is running 0.7.1 (I will update to the latest soon), but was anythign like that fixed recently? I have a crashdump if needed.

@behlendorf
Copy link
Contributor Author

@verygreen we haven't fixed anything like this recently. Although, we have been able to occasionally reproduce something which might be related using ztest and we're looking in to that.

fuhrmannb pushed a commit to fuhrmannb/cstor that referenced this issue Nov 3, 2020
mmaybee pushed a commit to mmaybee/openzfs that referenced this issue Apr 6, 2022
`cargo test` runs both the explicit tests and any code in doc comments
that is not explicitly ignored.  There are some failures in
lazy_static_ptr.rs and vec_ext.rs, which this commit fixes by making the
example code actually runnable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants