userquota_updates_task NULL deref #7147

lnicola · 2018-02-08T20:57:15Z

System information

Type	Version/Name
Distribution Name	Arch Linux
Distribution Version
Linux Kernel	4.15.1-2-ARCH
Architecture	x86-64
ZFS Version	0.7.0-283_g6d82b7969
SPL Version	0.7.0-24_g23602fd

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

[Feb 8 21:52] general protection fault: 0000 [#1] SMP PTI
[  +0.000786] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype xt_conntrack br_netfilter bridge stp llc tun nft_meta nf_conntrack_ipv6 nf_defrag_ipv6 nft_ct nft_set_bitmap nct6775 nft_set_hash nft_set_rbtree hwmon_vid xt_tcpudp nf_tables_inet nf_tables_ipv6 nf_tables_ipv4 iptable_filter nf_tables iptable_nat nf_conntrack_ipv4 nfnetlink nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32c_generic i915 nls_iso8859_1 nls_cp437 vfat fat ofpart cmdlinepart intel_spi_platform intel_spi spi_nor iTCO_wdt iTCO_vendor_support mtd ppdev drm_kms_helper snd_usb_audio snd_usbmidi_lib snd_hwdep snd_rawmidi snd_seq_device drm intel_rapl hci_uart snd_pcm intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp btbcm coretemp btqca btintel crct10dif_pclmul
[  +0.006028]  bluetooth crc32_pclmul crc32c_intel snd_timer ghash_clmulni_intel cryptd snd joydev intel_cstate input_leds led_class soundcore intel_gtt i2c_i801 cdc_acm agpgart r8169 mei_txe syscopyarea sysfillrect sysimgblt mei fb_sys_fops mii i2c_algo_bit shpchp parport_pc lpc_ich rtc_cmos ecdh_generic parport rfkill i2c_hid crc16 pwm_lpss_platform evdev pwm_lpss mac_hid sch_fq_codel ip_tables x_tables zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) hid_generic usbhid hid sd_mod ahci libahci libata xhci_pci xhci_hcd scsi_mod usbcore usb_common
[  +0.005001] CPU: 3 PID: 413 Comm: dp_sync_taskq Tainted: P           O     4.15.1-2-ARCH #1
[  +0.001799] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Q1900DC-ITX, BIOS P1.50 03/31/2016
[  +0.002037] RIP: 0010:multilist_sublist_remove+0x10/0x30 [zfs]
[  +0.002025] RSP: 0018:ffffb19085d43d30 EFLAGS: 00010286
[  +0.002053] RAX: ffff9f621a45c8c0 RBX: ffff9f608780c368 RCX: 0000000000000000
[  +0.002112] RDX: dead000000000100 RSI: ffff9f608780c448 RDI: ffff9f621a45c880
[  +0.002127] RBP: ffff9f62e85ec000 R08: 0000000000000020 R09: 0000000000000002
[  +0.002163] R10: 0000000000000000 R11: ffffb19085d43cfe R12: ffff9f608780c488
[  +0.002198] R13: ffff9f62e8724180 R14: ffff9f621a45c880 R15: ffff9f608780c4a8
[  +0.002221] FS:  0000000000000000(0000) GS:ffff9f62ffd80000(0000) knlGS:0000000000000000
[  +0.002280] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.002299] CR2: 00007f3a81969778 CR3: 0000000422038000 CR4: 00000000001006e0
[  +0.002354] Call Trace:
[  +0.002424]  userquota_updates_task+0xd2/0x520 [zfs]
[  +0.002461]  ? dmu_objset_userobjspace_upgradable+0x50/0x50 [zfs]
[  +0.002499]  ? dmu_objset_userobjspace_upgradable+0x50/0x50 [zfs]
[  +0.002449]  taskq_thread+0x295/0x4a0 [spl]
[  +0.002467]  ? wake_up_q+0x80/0x80
[  +0.002513]  ? taskq_thread_should_stop+0x70/0x70 [spl]
[  +0.002547]  kthread+0x113/0x130
[  +0.002524]  ? kthread_create_on_node+0x70/0x70
[  +0.002544]  ret_from_fork+0x35/0x40
[  +0.002555] Code: 08 48 89 06 48 89 56 08 48 89 32 f3 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 03 77 38 48 8b 46 08 48 8b 16 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 06 48 
[  +0.005711] RIP: multilist_sublist_remove+0x10/0x30 [zfs] RSP: ffffb19085d43d30
[  +0.002920] ---[ end trace 5d8b030e0c63f2c7 ]---

The text was updated successfully, but these errors were encountered:

tcaputi · 2018-02-09T08:58:14Z

As I said in #7059:

I noticed that your zfs version is the same as the guy in this issue who was having some strange problems as well. That version seems to point to a commit that doesn't exist in the ZoL codebase, and I don't know where it is coming from. I would suggest trying the latest code from master (assuming you are ok running on the bleeding edge) or the latest point release v0.7.6.

If you hit this issue again after that, let me know and I'll investigate further.

lnicola · 2018-02-09T10:17:49Z

I suppose you've realized by now that I'm the same guy in all of the threads 😄 .

The situation is as follows. I have two pools, call them bike and tank. bike is faster, so I keep my root filesystem and a couple of other on it, make daily snapshots and send them from time to time over to tank. The last snapshot on tank is from last night. EDIT: I was using large dnodes, so sends from bike to tank were not working for a while, until #6864. After that was merged, sends worked again, but were very slow on some filesystems.

I've first seen the userquota_updates_task error yesterday, but shrugged it off as "harmless".

This morning I started a scrub on tank, then one on bike which got stuck. At that time I thought the first scrub is somehow blocking the second, but now I suspect ~~that wasn't true~~ (or maybe it was, see the end of the post). The was nothing in dmesg about this. Cancelling the first scrub worked, but everything got stuck waiting for disk on shutdown, so I power-cycled the box.

Now I'm no longer able to boot from bike, as it panics with VERIFY3(0 == dmu_object_claim_dnsize()) failed (0 == 28). But I can boot last night's snapshot of it from tank. I'm currently sending everything over to another pool. This worked fine until it hit the root filesystem on tank, which seems to be progressing very slowly. It's interesting that it's reading more data than it's writing, which didn't seem to happen on the other filesystems:

$ sudo zpool iostat tank freighter 10
              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
freighter   1.20T  3.33T      1    192  7.26K  64.0M
tank        1.12T   714G    139      0  61.1M  12.9K
----------  -----  -----  -----  -----  -----  -----
freighter   1.20T  3.33T      0    294      0  3.48M
tank        1.12T   714G    128      0  16.6M      0
----------  -----  -----  -----  -----  -----  -----
freighter   1.20T  3.33T      2    281  9.99K  3.31M
tank        1.12T   714G    115      0  14.5M      0
----------  -----  -----  -----  -----  -----  -----
freighter   1.20T  3.33T      0    260  1.60K  3.20M
tank        1.12T   714G    143      0  17.1M      0
----------  -----  -----  -----  -----  -----  -----
freighter   1.20T  3.33T      0    273      0  3.58M
tank        1.12T   714G    146      0  20.7M      0
----------  -----  -----  -----  -----  -----  -----
freighter   1.20T  3.33T      0    287  3.60K  3.79M
tank        1.12T   714G    110      0  16.6M      0
----------  -----  -----  -----  -----  -----  -----
freighter   1.20T  3.33T      6    221  27.2K  2.71M
tank        1.12T   714G     86      0  12.0M      0
----------  -----  -----  -----  -----  -----  -----
freighter   1.20T  3.33T      9    223  92.3K  11.0M
tank        1.12T   714G    359      1  22.8M  49.6K
----------  -----  -----  -----  -----  -----  -----

Additionally, I got the error in this issue at boot:

[   59.952154] general protection fault: 0000 [#1] SMP PTI
[   59.952213] Modules linked in: tun nft_meta nct6775 hwmon_vid nf_conntrack_ipv6 nf_defrag_ipv6 nft_ct nft_set_bitmap nft_set_hash nft_set_rbtree xt_tcpudp nf_tables_inet nf_tables_ipv6 nf_tables_ipv4 nf_tables nfnetlink iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32c_generic nls_iso8859_1 nls_cp437 vfat fat intel_rapl intel_soc_dts_thermal intel_soc_dts_iosf i915 snd_usb_audio snd_usbmidi_lib intel_powerclamp ofpart cmdlinepart coretemp intel_spi_platform intel_spi snd_hwdep spi_nor iTCO_wdt crct10dif_pclmul drm_kms_helper snd_rawmidi crc32_pclmul drm iTCO_vendor_support mtd ppdev r8169 snd_seq_device crc32c_intel snd_pcm snd_timer mii snd cdc_acm intel_gtt ghash_clmulni_intel cryptd intel_cstate input_leds soundcore led_class agpgart hci_uart
[   59.952826]  joydev syscopyarea sysfillrect sysimgblt fb_sys_fops i2c_algo_bit rtc_cmos btbcm shpchp mei_txe btqca mei btintel lpc_ich i2c_i801 bluetooth parport_pc parport ecdh_generic i2c_hid rfkill crc16 evdev pwm_lpss_platform pwm_lpss mac_hid sch_fq_codel ip_tables x_tables zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) hid_generic usbhid hid sd_mod ahci xhci_pci libahci xhci_hcd libata usbcore scsi_mod usb_common
[   59.953176] CPU: 1 PID: 322 Comm: dp_sync_taskq Tainted: P           O     4.15.1-2-ARCH #1
[   59.953245] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Q1900DC-ITX, BIOS P1.50 03/31/2016
[   59.953404] RIP: 0010:multilist_sublist_remove+0x10/0x30 [zfs]
[   59.953455] RSP: 0018:ffffb0b145b7bd30 EFLAGS: 00010286
[   59.953501] RAX: ffff978bfb9688c0 RBX: ffff978bfacb28e0 RCX: 0000000000000000
[   59.953561] RDX: dead000000000100 RSI: ffff978bfacb29f0 RDI: ffff978bfb968880
[   59.953622] RBP: ffff978c68f1b800 R08: 0000000000000020 R09: 0000000000000008
[   59.953681] R10: 0000000000000000 R11: ffffb0b145b7bcfe R12: ffff978bfacb2a00
[   59.953741] R13: ffff978c1329aa80 R14: ffff978bfb968880 R15: ffff978bfacb2a20
[   59.953802] FS:  0000000000000000(0000) GS:ffff978c7fc80000(0000) knlGS:0000000000000000
[   59.953870] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   59.953919] CR2: 00007fe15b6d25b4 CR3: 00000003d389e000 CR4: 00000000001006e0
[   59.953979] Call Trace:
[   59.954061]  userquota_updates_task+0xd2/0x520 [zfs]
[   59.954159]  ? dmu_objset_userobjspace_upgradable+0x50/0x50 [zfs]
[   59.954264]  ? dmu_objset_userobjspace_upgradable+0x50/0x50 [zfs]
[   59.954324]  taskq_thread+0x295/0x4a0 [spl]
[   59.954367]  ? wake_up_q+0x80/0x80
[   59.954406]  ? taskq_thread_should_stop+0x70/0x70 [spl]
[   59.954452]  kthread+0x113/0x130
[   59.954484]  ? kthread_create_on_node+0x70/0x70
[   59.954527]  ret_from_fork+0x35/0x40
[   59.954561] Code: 08 48 89 06 48 89 56 08 48 89 32 f3 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 03 77 38 48 8b 46 08 48 8b 16 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 06 48 
[   59.954818] RIP: multilist_sublist_remove+0x10/0x30 [zfs] RSP: ffffb0b145b7bd30
[   59.954886] general protection fault: 0000 [#2] SMP PTI
[   59.954909] ---[ end trace c381eb953d40c75a ]---
[   59.955056] Modules linked in: tun nft_meta nct6775 hwmon_vid nf_conntrack_ipv6 nf_defrag_ipv6 nft_ct nft_set_bitmap nft_set_hash nft_set_rbtree xt_tcpudp nf_tables_inet nf_tables_ipv6 nf_tables_ipv4 nf_tables nfnetlink iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32c_generic nls_iso8859_1 nls_cp437 vfat fat intel_rapl intel_soc_dts_thermal intel_soc_dts_iosf i915 snd_usb_audio snd_usbmidi_lib intel_powerclamp ofpart cmdlinepart coretemp intel_spi_platform intel_spi snd_hwdep spi_nor iTCO_wdt crct10dif_pclmul drm_kms_helper snd_rawmidi crc32_pclmul drm iTCO_vendor_support mtd ppdev r8169 snd_seq_device crc32c_intel snd_pcm snd_timer mii snd cdc_acm intel_gtt ghash_clmulni_intel cryptd intel_cstate input_leds soundcore led_class agpgart hci_uart
[   59.956082]  joydev syscopyarea sysfillrect sysimgblt fb_sys_fops i2c_algo_bit rtc_cmos btbcm shpchp mei_txe btqca mei btintel lpc_ich i2c_i801 bluetooth parport_pc parport ecdh_generic i2c_hid rfkill crc16 evdev pwm_lpss_platform pwm_lpss mac_hid sch_fq_codel ip_tables x_tables zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) hid_generic usbhid hid sd_mod ahci xhci_pci libahci xhci_hcd libata usbcore scsi_mod usb_common
[   59.956662] CPU: 3 PID: 321 Comm: dp_sync_taskq Tainted: P      D    O     4.15.1-2-ARCH #1
[   59.956779] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Q1900DC-ITX, BIOS P1.50 03/31/2016
[   59.957028] RIP: 0010:multilist_sublist_remove+0x10/0x30 [zfs]
[   59.957112] RSP: 0000:ffffb0b145b73d30 EFLAGS: 00010286
[   59.957189] RAX: ffff978bfb968940 RBX: ffff978c14adf680 RCX: 0000000000000000
[   59.957290] RDX: dead000000000100 RSI: ffff978c14adf790 RDI: ffff978bfb968900
[   59.957390] RBP: ffff978c68f1b800 R08: 0000000000000020 R09: 0000000000000000
[   59.957490] R10: ffff978be53391e0 R11: ffffb0b145b73cff R12: ffff978c14adf7a0
[   59.957591] R13: ffff978c1329aa80 R14: ffff978bfb968900 R15: ffff978c14adf7c0
[   59.957692] FS:  0000000000000000(0000) GS:ffff978c7fd80000(0000) knlGS:0000000000000000
[   59.957805] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   59.957888] CR2: 00007f1296906000 CR3: 00000003a9448000 CR4: 00000000001006e0
[   59.957988] Call Trace:
[   59.958123]  userquota_updates_task+0xd2/0x520 [zfs]
[   59.958287]  ? dmu_objset_userobjspace_upgradable+0x50/0x50 [zfs]
[   59.958463]  ? dmu_objset_userobjspace_upgradable+0x50/0x50 [zfs]
[   59.958563]  taskq_thread+0x295/0x4a0 [spl]
[   59.958632]  ? wake_up_q+0x80/0x80
[   59.958697]  ? taskq_thread_should_stop+0x70/0x70 [spl]
[   59.958775]  kthread+0x113/0x130
[   59.958828]  ? kthread_create_on_node+0x70/0x70
[   59.958896]  ? kthread_create_on_node+0x70/0x70
[   59.958966]  ret_from_fork+0x35/0x40
[   59.959022] Code: 08 48 89 06 48 89 56 08 48 89 32 f3 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 03 77 38 48 8b 46 08 48 8b 16 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 06 48 
[   59.959454] RIP: multilist_sublist_remove+0x10/0x30 [zfs] RSP: ffffb0b145b73d30
[   59.959599] ---[ end trace c381eb953d40c75b ]---
[   59.959999] general protection fault: 0000 [#3] SMP PTI
[   59.964598] Modules linked in: tun nft_meta nct6775 hwmon_vid nf_conntrack_ipv6 nf_defrag_ipv6 nft_ct nft_set_bitmap nft_set_hash nft_set_rbtree xt_tcpudp nf_tables_inet nf_tables_ipv6 nf_tables_ipv4 nf_tables nfnetlink iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32c_generic nls_iso8859_1 nls_cp437 vfat fat intel_rapl intel_soc_dts_thermal intel_soc_dts_iosf i915 snd_usb_audio snd_usbmidi_lib intel_powerclamp ofpart cmdlinepart coretemp intel_spi_platform intel_spi snd_hwdep spi_nor iTCO_wdt crct10dif_pclmul drm_kms_helper snd_rawmidi crc32_pclmul drm iTCO_vendor_support mtd ppdev r8169 snd_seq_device crc32c_intel snd_pcm snd_timer mii snd cdc_acm intel_gtt ghash_clmulni_intel cryptd intel_cstate input_leds soundcore led_class agpgart hci_uart
[   59.989065]  joydev syscopyarea sysfillrect sysimgblt fb_sys_fops i2c_algo_bit rtc_cmos btbcm shpchp mei_txe btqca mei btintel lpc_ich i2c_i801 bluetooth parport_pc parport ecdh_generic i2c_hid rfkill crc16 evdev pwm_lpss_platform pwm_lpss mac_hid sch_fq_codel ip_tables x_tables zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) hid_generic usbhid hid sd_mod ahci xhci_pci libahci xhci_hcd libata usbcore scsi_mod usb_common
[   60.003915] CPU: 3 PID: 323 Comm: dp_sync_taskq Tainted: P      D    O     4.15.1-2-ARCH #1
[   60.008742] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Q1900DC-ITX, BIOS P1.50 03/31/2016
[   60.013596] RIP: 0010:multilist_sublist_remove+0x10/0x30 [zfs]
[   60.018280] RSP: 0000:ffffb0b145b83d30 EFLAGS: 00010286
[   60.022938] RAX: ffff978bfb9689c0 RBX: ffff978c0cdc57d8 RCX: 0000000000000000
[   60.027591] RDX: dead000000000100 RSI: ffff978c0cdc58e8 RDI: ffff978bfb968980
[   60.032209] RBP: ffff978c68f1b800 R08: 0000000000000020 R09: 0000000000000003
[   60.036838] R10: ffff978c14c0ac20 R11: ffffb0b145b83cff R12: ffff978c0cdc58f8
[   60.041458] R13: ffff978c1329aa80 R14: ffff978bfb968980 R15: ffff978c0cdc5918
[   60.046130] FS:  0000000000000000(0000) GS:ffff978c7fd80000(0000) knlGS:0000000000000000
[   60.050917] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   60.055546] CR2: 00007f1296906000 CR3: 00000003a9448000 CR4: 00000000001006e0
[   60.060209] Call Trace:
[   60.064905]  userquota_updates_task+0xd2/0x520 [zfs]
[   60.069575]  ? dmu_objset_userobjspace_upgradable+0x50/0x50 [zfs]
[   60.074250]  ? dmu_objset_userobjspace_upgradable+0x50/0x50 [zfs]
[   60.078835]  taskq_thread+0x295/0x4a0 [spl]
[   60.083416]  ? wake_up_q+0x80/0x80
[   60.087983]  ? taskq_thread_should_stop+0x70/0x70 [spl]
[   60.092578]  kthread+0x113/0x130
[   60.097155]  ? kthread_create_on_node+0x70/0x70
[   60.101754]  ret_from_fork+0x35/0x40
[   60.106337] Code: 08 48 89 06 48 89 56 08 48 89 32 f3 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 03 77 38 48 8b 46 08 48 8b 16 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 06 48 
[   60.115975] RIP: multilist_sublist_remove+0x10/0x30 [zfs] RSP: ffffb0b145b83d30
[   60.120975] ---[ end trace c381eb953d40c75c ]---

Both pools are online right now, zpool status reports no issues. I do have a nagging dmesg warning: do_IRQ: 3.33 No irq handler for vector, which is a hopefully unrelated kernel issue. The root filesystem on bike is not mounted.

EDIT: As mentioned below, I tried to start a scrub on the backup pool, then one on tank which got stuck. A snapshot on a tank filesystem also got stuck.

lnicola · 2018-02-09T11:34:36Z

The plot thickens...

I started a scrub on the back-up pool, then tried to start one on tank. zpool scrub got stuck again:

[<0>] cv_wait_common+0x115/0x130 [spl]
[<0>] txg_wait_synced+0xdc/0x120 [zfs]
[<0>] dsl_sync_task+0x15c/0x240 [zfs]
[<0>] dsl_scan+0x91/0x100 [zfs]
[<0>] zfs_ioc_pool_scan+0x53/0xc0 [zfs]
[<0>] zfsdev_ioctl+0x1d8/0x610 [zfs]
[<0>] do_vfs_ioctl+0xa4/0x630
[<0>] SyS_ioctl+0x74/0x80
[<0>] entry_SYSCALL_64_fastpath+0x20/0x83
[<0>] 0xffffffffffffffff

while zpool status says that the scrub didn't start yet.

Since the root fs on bike is no longer involved, I'm not sure what to do next.

tcaputi · 2018-02-09T11:38:09Z

while you have the system in this state can you run echo t > /proc/sysrq-trigger and post the output that will end up in your kernel logs? It will be a lot of output.. This will show me a stack trace of all the processes running on your system, so I understand if you do not wish to do this.

If your kernel is crashed there is no need to do this, the crash is almost certainly causing the problem.

lnicola · 2018-02-09T11:44:19Z

Some things are still working, but it's mostly crashed, I suppose. I was able to export the back-up pool.

dmesg.zip

tcaputi · 2018-02-09T11:59:15Z

OK. Sorry to ask this but humor me for a second. You have opened several issues and at this point I'm having a hard time keeping it all straight. Could you quickly go through the timeline of what you have, what you did, and when the issues you've encountered happened? I'm going to guess that they are all probably related, so having the full timeline for all of them will probably help a lot.

lnicola · 2018-02-09T12:08:35Z

That would be #7147 (comment) above, I suppose.

tcaputi · 2018-02-09T12:16:10Z

Just so I have it right, here is my summary of that (along with the questions I still have). Can you confirm this and answer the questions in parenthesis as best you can? I'm also not sure at what point you started encountering the dmu_object_claim_dnsize() issue, if you could provide that info as well.

bike crashed with userquota_updates_task panic
user attempted to run a scrub and noticed that it didn't return (almost definitely due to the crash)
user power-cycled the box and could no longer boot from bike (due to what issue? userquota_updates_task panic?)
user booted into a backup of his root filesystem on tank
user sent his dataset to another pool and noticed the root dataset was slow (probably unrelated)
once the send completed, user attempted to scrub tank and it didn't return again (did it crash again this time?)

lnicola · 2018-02-09T12:28:27Z

bike crashed with userquota_updates_task panic

I saw the error message, but there was no apparent crash, and everything seemed to work afterwards. I didn't reboot. I don't know for which pool it was.

user attempted to run a scrub and noticed that it didn't return (almost definitely due to the crash)

I started a scrub on tank, then tried to start one on bike, but it never started.

user power-cycled the box and could no longer boot from bike (due to what issue? userquota_updates_task panic?)

Due to a dmu_object_claim_dnsize() panic when mounting bike.

user booted into a backup of his root filesystem on tank

Right. The userquota_updates_task showed up again a couple of times on boot, again I don't know for which pool.

user sent his dataset to another pool and noticed the root dataset was slow (probably unrelated)

Before the first crash, sending bike/zroot and bike/home to tank was slow. Sending tank/zroot (copy of bike/zroot) to a backup pool was probably fine, but sending tank/bike/home was slow. bike/home works fine nonetheless.

once the send completed, user attempted to scrub tank and it didn't return again (did it crash again this time?)

No crash, same txg_wait_synced hang. Making a snapshot of another filesystem on tank also hangs, but copying the data from it with cp seems to work.

tcaputi · 2018-02-09T12:34:04Z

Alright I think I have a grasp on everything now and I'll try to take a look today.

tcaputi · 2018-02-09T19:54:20Z

My apologies. I have run out of time for today and will need to look into this more on Monday.

tcaputi · 2018-02-13T19:36:45Z

@lnicola Would you mind running:

gdb `modinfo zfs | grep filename | awk '{ print $2 }'`
list *(userquota_updates_task+0xd2)

This should help em figure out where the initial crash happened.

lnicola · 2018-02-13T19:45:56Z

Sorry, I updated my kernel and ZFS version in the meanwhile, ~~and my kernels have no debug info anyway~~. But there seems to be a single multilist_sublist_remove call in that function, so that might be a good place to start.

EDIT: The file system in cause got rolled back somehow, so I'm no longer having any issues. If you feel like closing this, that's fine.

lnicola · 2018-02-13T20:13:35Z

Dump of assembler code for function multilist_sublist_remove:
   0x00000000000668c0 <+0>:	e8 00 00 00 00	callq  0x668c5 <multilist_sublist_remove+5>
   0x00000000000668c5 <+5>:	48 03 77 38	add    0x38(%rdi),%rsi
   0x00000000000668c9 <+9>:	48 8b 46 08	mov    0x8(%rsi),%rax
   0x00000000000668cd <+13>:	48 8b 16	mov    (%rsi),%rdx # crash here, or below
   0x00000000000668d0 <+16>:	48 89 42 08	mov    %rax,0x8(%rdx)
   0x00000000000668d4 <+20>:	48 89 10	mov    %rdx,(%rax)
   0x00000000000668d7 <+23>:	48 b8 00 01 00 00 00 00 ad de	movabs $0xdead000000000100,%rax
   0x00000000000668e1 <+33>:	48 89 06	mov    %rax,(%rsi)
   0x00000000000668e4 <+36>:	48 05 00 01 00 00	add    $0x100,%rax
   0x00000000000668ea <+42>:	48 89 46 08	mov    %rax,0x8(%rsi)
   0x00000000000668ee <+46>:	c3	retq

tcaputi · 2018-02-21T15:04:05Z

@lnicola would you mind trying this again with the most recent code from master. We recently fixed a related issue to this and I think the problem might be solved.

lnicola · 2018-02-21T15:05:36Z

I'll try it on the next distro package update, but the only issue I still have is the slow recv.

tcaputi · 2018-02-21T15:11:40Z

Ok. That problem should mostly be alleviated by #7197 so hopefully all your problems should be fixed soon.

tcaputi · 2018-02-28T20:48:34Z

@lnicola any update here?

lnicola · 2018-02-28T20:53:45Z

I mentioned before, the file system or pool got rolled back at the time. I lost 9 hours of changes or so, and I didn't encounter any issues afterwards, even with that same ZFS version. So I don't know if it's fixed, but I no longer have a way to reproduce these issues.

lnicola · 2018-03-08T18:27:06Z

I just got a new one, in 0.7.0-356_g639b18944:

[103861.445982] general protection fault: 0000 [#1] PREEMPT SMP PTI
[103861.446061] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype xt_conntrack br_netfilter bridge stp llc tun nft_meta nf_conntrack_ipv6 nf_defrag_ipv6 nft_ct xt_tcpudp nft_set_bitmap nft_set_hash nft_set_rbtree iptable_filter nf_tables_inet nf_tables_ipv6 nf_tables_ipv4 nf_tables nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32c_generic nct6775 hwmon_vid nls_iso8859_1 nls_cp437 vfat fat i915 ofpart cmdlinepart intel_spi_platform intel_spi spi_nor mtd iTCO_wdt iTCO_vendor_support ppdev snd_usb_audio snd_usbmidi_lib i2c_algo_bit snd_hwdep drm_kms_helper snd_rawmidi intel_rapl intel_soc_dts_thermal snd_seq_device intel_soc_dts_iosf intel_powerclamp snd_pcm coretemp crct10dif_pclmul crc32_pclmul
[103861.446689]  hci_uart drm crc32c_intel ghash_clmulni_intel btbcm cryptd snd_timer intel_cstate btqca btintel pcspkr snd bluetooth soundcore mei_txe r8169 mei lpc_ich i2c_i801 intel_gtt agpgart mii syscopyarea cdc_acm sysfillrect shpchp sysimgblt fb_sys_fops rtc_cmos parport_pc parport ecdh_generic rfkill evdev crc16 mac_hid i2c_hid pwm_lpss_platform pwm_lpss ip_tables x_tables zfs(PO) zunicode(PO) zlua(PO) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) hid_generic usbhid hid sd_mod ahci xhci_pci libahci xhci_hcd libata usbcore scsi_mod usb_common
[103861.447137] CPU: 0 PID: 433 Comm: dp_sync_taskq Tainted: P           O     4.15.7-1-ARCH #1
[103861.447209] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Q1900DC-ITX, BIOS P1.50 03/31/2016
[103861.447378] RIP: 0010:multilist_sublist_remove+0x10/0x30 [zfs]
[103861.447433] RSP: 0018:ffffb7fd45e6bd20 EFLAGS: 00010286
[103861.447482] RAX: ffff94bc422f6e40 RBX: ffff94bd22559000 RCX: 0000000000000001
[103861.447545] RDX: dead000000000100 RSI: ffff94bb4fb44100 RDI: ffff94bc422f6e00
[103861.447607] RBP: ffff94bb4fb44000 R08: 0000000000000000 R09: 0000000000000003
[103861.447669] R10: ffffb7fd45e6bbe8 R11: ffffb7fd45e6bcdd R12: ffff94bb4fb44120
[103861.447731] R13: ffff94bce832c000 R14: ffff94bc422f6e00 R15: ffff94bb4fb44140
[103861.447794] FS:  0000000000000000(0000) GS:ffff94bd3fc00000(0000) knlGS:0000000000000000
[103861.447864] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[103861.447916] CR2: 00007f75bac13db8 CR3: 000000032ab9a000 CR4: 00000000001006f0
[103861.447978] Call Trace:
[103861.448075]  userquota_updates_task+0xdf/0x5b0 [zfs]
[103861.448180]  ? dmu_objset_is_dirty+0x20/0x20 [zfs]
[103861.448281]  ? dmu_objset_is_dirty+0x20/0x20 [zfs]
[103861.448378]  ? dmu_objset_is_dirty+0x20/0x20 [zfs]
[103861.448432]  taskq_thread+0x295/0x4a0 [spl]
[103861.448478]  ? wake_up_q+0x80/0x80
[103861.448523]  ? taskq_thread_should_stop+0x70/0x70 [spl]
[103861.448572]  kthread+0x113/0x130
[103861.448605]  ? kthread_create_on_node+0x70/0x70
[103861.448652]  ret_from_fork+0x35/0x40
[103861.448691] Code: 08 48 89 06 48 89 56 08 48 89 32 f3 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 03 77 38 48 8b 46 08 48 8b 16 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 06 48 
[103861.448990] RIP: multilist_sublist_remove+0x10/0x30 [zfs] RSP: ffffb7fd45e6bd20
[103861.449108] ---[ end trace 4e6621a5dfbf95b1 ]---

The application (a rm) is now hung.

EDIT: Unlike the last time, the pool came back fine (I think) after the reboot.

EDIT 2: Happened again during a rm after running the same workload (compiling an AUR package).

lnicola · 2018-03-11T08:11:45Z

Call traces of processes in D state after the hang:

[104651.891826] txg_sync        D    0   578      2 0x80000000
[104651.893818] Call Trace:
[104651.895780]  ? __schedule+0x24b/0x8c0
[104651.898000]  ? preempt_count_add+0x68/0xa0
[104651.899966]  schedule+0x32/0x90
[104651.901931]  taskq_wait+0x72/0xc0 [spl]
[104651.903896]  ? wait_woken+0x80/0x80
[104651.905920]  dsl_pool_sync+0x17c/0x420 [zfs]
[104651.907946]  spa_sync+0x425/0xcf0 [zfs]
[104651.909962]  txg_sync_thread+0x2cc/0x490 [zfs]
[104651.911978]  ? txg_delay+0x1a0/0x1a0 [zfs]
[104651.913929]  ? __thread_exit+0x20/0x20 [spl]
[104651.915899]  thread_generic_wrapper+0x6f/0x80 [spl]
[104651.917872]  kthread+0x113/0x130
[104651.919837]  ? kthread_create_on_node+0x70/0x70
[104651.921804]  ? kthread_create_on_node+0x70/0x70
[104651.923756]  ret_from_fork+0x35/0x40

[104655.089910] python2.7       D    0  8222      1 0x00000000
[104655.091994] Call Trace:
[104655.094057]  ? __schedule+0x24b/0x8c0
[104655.096120]  ? preempt_count_add+0x68/0xa0
[104655.098185]  schedule+0x32/0x90
[104655.100251]  cv_wait_common+0x113/0x130 [spl]
[104655.102340]  ? wait_woken+0x80/0x80
[104655.104488]  txg_wait_open+0xab/0xf0 [zfs]
[104655.106624]  dmu_free_long_range_impl+0x223/0x410 [zfs]
[104655.108776]  dmu_free_long_range+0x6a/0xb0 [zfs]
[104655.110929]  zfs_rmnode+0x24f/0x320 [zfs]
[104655.113102]  ? zfs_zinactive+0xd3/0xf0 [zfs]
[104655.115287]  zfs_inactive+0x7a/0x1f0 [zfs]
[104655.117415]  ? unmap_mapping_range+0x73/0x140
[104655.119608]  zpl_evict_inode+0x3c/0x50 [zfs]
[104655.121752]  evict+0xc4/0x190
[104655.123880]  do_unlinkat+0x1ab/0x320
[104655.126012]  do_syscall_64+0x74/0x190
[104655.128143]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

[104664.966746] postgres        D    0  4892   4612 0x00000000
[104664.968923] Call Trace:
[104664.971086]  ? __schedule+0x24b/0x8c0
[104664.973224]  ? preempt_count_add+0x68/0xa0
[104664.975376]  schedule+0x32/0x90
[104664.977518]  cv_wait_common+0x113/0x130 [spl]
[104664.979654]  ? wait_woken+0x80/0x80
[104664.981884]  txg_wait_open+0xab/0xf0 [zfs]
[104664.984086]  dmu_free_long_range_impl+0x223/0x410 [zfs]
[104664.986328]  dmu_free_long_range+0x6a/0xb0 [zfs]
[104664.988543]  zfs_rmnode+0x24f/0x320 [zfs]
[104664.990769]  ? zfs_zinactive+0xd3/0xf0 [zfs]
[104664.992991]  zfs_inactive+0x7a/0x1f0 [zfs]
[104664.995297]  ? unmap_mapping_range+0x73/0x140
[104664.997504]  zpl_evict_inode+0x3c/0x50 [zfs]
[104664.999648]  evict+0xc4/0x190
[104665.001771]  __dentry_kill+0xc3/0x150
[104665.003869]  dput.part.4+0x131/0x1e0
[104665.005979]  SyS_renameat2+0x3bc/0x590
[104665.008079]  do_syscall_64+0x74/0x190
[104665.010177]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

[104669.725973] rm              D    0 10213   2062 0x00000000
[104669.727987] Call Trace:
[104669.729983]  ? __schedule+0x24b/0x8c0
[104669.731985]  ? preempt_count_add+0x68/0xa0
[104669.733992]  schedule+0x32/0x90
[104669.735996]  cv_wait_common+0x113/0x130 [spl]
[104669.738012]  ? wait_woken+0x80/0x80
[104669.740096]  txg_wait_open+0xab/0xf0 [zfs]
[104669.742170]  dmu_free_long_range_impl+0x223/0x410 [zfs]
[104669.744268]  dmu_free_long_range+0x6a/0xb0 [zfs]
[104669.746355]  zfs_rmnode+0x24f/0x320 [zfs]
[104669.748447]  ? zfs_zinactive+0xd3/0xf0 [zfs]
[104669.750543]  zfs_inactive+0x7a/0x1f0 [zfs]
[104669.752565]  ? unmap_mapping_range+0x73/0x140
[104669.754637]  zpl_evict_inode+0x3c/0x50 [zfs]
[104669.756622]  evict+0xc4/0x190
[104669.758581]  do_unlinkat+0x1ab/0x320
[104669.760537]  do_syscall_64+0x74/0x190
[104669.762487]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

[104671.523356] redis-server    D    0 15279   4325 0x00000100
[104671.525335] Call Trace:
[104671.527306]  ? __schedule+0x24b/0x8c0
[104671.529294]  ? preempt_count_add+0x68/0xa0
[104671.531289]  schedule+0x32/0x90
[104671.533269]  cv_wait_common+0x113/0x130 [spl]
[104671.535262]  ? wait_woken+0x80/0x80
[104671.537313]  txg_wait_open+0xab/0xf0 [zfs]
[104671.539359]  dmu_free_long_range_impl+0x223/0x410 [zfs]
[104671.541418]  dmu_free_long_range+0x6a/0xb0 [zfs]
[104671.543483]  zfs_rmnode+0x24f/0x320 [zfs]
[104671.545519]  ? zfs_zinactive+0xd3/0xf0 [zfs]
[104671.547563]  zfs_inactive+0x7a/0x1f0 [zfs]
[104671.549556]  ? unmap_mapping_range+0x73/0x140
[104671.551592]  zpl_evict_inode+0x3c/0x50 [zfs]
[104671.553835]  evict+0xc4/0x190
[104671.555806]  __dentry_kill+0xc3/0x150
[104671.557788]  dput.part.4+0x131/0x1e0
[104671.559773]  SyS_renameat2+0x3bc/0x590
[104671.561739]  do_syscall_64+0x74/0x190
[104671.563719]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

[104671.873790] htop            D    0 15971  15826 0x00000004
[104671.875834] Call Trace:
[104671.877876]  ? __schedule+0x24b/0x8c0
[104671.879904]  ? preempt_count_add+0x68/0xa0
[104671.881958]  schedule+0x32/0x90
[104671.884022]  cv_wait_common+0x113/0x130 [spl]
[104671.886124]  ? wait_woken+0x80/0x80
[104671.888278]  txg_wait_open+0xab/0xf0 [zfs]
[104671.890420]  dmu_free_long_range_impl+0x223/0x410 [zfs]
[104671.892575]  dmu_free_long_range+0x6a/0xb0 [zfs]
[104671.894737]  zfs_trunc+0x77/0x1e0 [zfs]
[104671.897121]  ? sa_lookup_locked+0x89/0xb0 [zfs]
[104671.899190]  ? preempt_count_add+0x68/0xa0
[104671.901282]  ? _raw_spin_lock+0x13/0x40
[104671.903422]  zfs_freesp+0x9b/0x430 [zfs]
[104671.905508]  ? preempt_count_add+0x68/0xa0
[104671.907588]  ? _raw_spin_lock+0x13/0x40
[104671.909673]  ? _raw_spin_unlock+0x16/0x30
[104671.911806]  ? zfs_zaccess_aces_check+0x1b1/0x390 [zfs]
[104671.913970]  ? zfs_zaccess+0x10b/0x3f0 [zfs]
[104671.916126]  zfs_setattr+0xd1d/0x22a0 [zfs]
[104671.918234]  ? spl_kmem_alloc_impl+0xd6/0x190 [spl]
[104671.920332]  ? spl_kmem_alloc_impl+0xd6/0x190 [spl]
[104671.922472]  ? rrw_exit+0x55/0x140 [zfs]
[104671.924577]  ? zpl_xattr_get_dir+0x5d/0x110 [zfs]
[104671.926614]  ? preempt_count_add+0x68/0xa0
[104671.928626]  ? __kmalloc_node+0x21b/0x2d0
[104671.930649]  ? spl_kmem_zalloc+0xc7/0x180 [spl]
[104671.932731]  zpl_setattr+0xf8/0x170 [zfs]
[104671.934762]  notify_change+0x294/0x420
[104671.936746]  do_truncate+0x75/0xc0
[104671.939022]  path_openat+0xc3f/0x1320
[104671.940986]  do_filp_open+0x9b/0x110
[104671.942959]  ? __vfs_write+0x36/0x180
[104671.944938]  ? __check_object_size+0x9b/0x1a0
[104671.946918]  ? __alloc_fd+0xaf/0x160
[104671.948907]  ? do_sys_open+0x1bd/0x250
[104671.950879]  do_sys_open+0x1bd/0x250
[104671.952854]  do_syscall_64+0x74/0x190
[104671.954807]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

tcaputi · 2018-03-11T08:17:58Z

@lnicola That is expected. Once the GPF happens, that thread will stop running, causing any threads that rely on it to also become stuck. Unfortunately, we need to find a way to reproduce the issue at least somewhat consistently if we are going to have a chance to fix it.

lnicola · 2018-03-28T20:09:52Z

Well, it seems I can still reproduce this, so tell me if there's anything I can do...

tcaputi · 2018-03-28T20:17:02Z

@lnicola can you provide steps that I can use to reproduce this? As of yet I have not been able to do so.

nivedita76 · 2018-04-01T21:47:02Z

The PANIC actually shows up after the list_del corruption though

nivedita76 · 2018-04-01T21:50:13Z

Here's the log

kern1.log

nivedita76 · 2018-04-01T21:51:44Z

It should be large dnodes, I'll check once I reboot.

Looking at that function not sure how that VERIFY fails though.. it checks that ref count is not zero, then adds 1 and checks that it's greater than 1, and that fails.

nivedita76 · 2018-04-01T22:02:50Z

Yes large_dnodes is active

nivedita76 · 2018-04-01T23:32:37Z

In dnode_move, don't we have to worry about dn_dirty_link in addition to dn_link?

tcaputi · 2018-04-02T16:06:46Z

@nivedita76 If you are able to reproduce the problem consistently, would it be possible to provide a stack trace of the VERIFY() you are hitting?

nivedita76 · 2018-04-02T18:18:09Z

How should I do that? The last log I attached was what got dumped into dmesg

tcaputi · 2018-04-02T18:42:16Z

Sorry. I missed it (needed to scroll down a bit farther). What you have provided is fine.

tcaputi · 2018-04-03T15:36:51Z

Might have an idea of what's causing this..... I'll get back soon if I figure anything out.

nivedita76 · 2018-04-03T23:21:20Z

I just had a list_del corruption even after configuring zfs_multilist_num_sublists to 1. Unfortunately this time it didn't save anything to the log before crashing.

lnicola · 2018-04-04T09:04:07Z

@nivedita76 You can try netconsole to catch crash dumps if you have another machine:

modprobe netconsole
dmesg -n 8
cd /sys/kernel/config/netconsole
mkdir -p target1
cd target1
echo XXXX > local_ip
echo XXXX > remote_ip
echo enXXXX > dev_name
echo XX:XX:XX:XX:XX:XX > remote_mac
echo 1 > enabled

tcaputi · 2018-04-04T14:02:39Z

@nivedita76 Try applying this patch:

diff --git a/module/zfs/dnode.c b/module/zfs/dnode.c
index 596983b..17be48e 100644
--- a/module/zfs/dnode.c
+++ b/module/zfs/dnode.c
@@ -1100,6 +1100,11 @@ dnode_check_slots_free(dnode_children_t *children, int idx, int slots)
                        if (type != DMU_OT_NONE)
                                return (B_FALSE);
 
+                       for (int i = 0; i < TXG_SIZE; i++) {
+                               if (list_link_active(&dn->dn_dirty_link[i]))
+                                       return (B_FALSE);
+                       }
+
                        continue;
                } else {
                        return (B_FALSE);

This needs some work before I merge it, but in our testing it seemed to fix the problem.

Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable. This patch adds a check for whether dn_dirty_link is active to determine if we are in this state. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>

tcaputi · 2018-04-04T21:16:10Z

@nivedita76 if you get a chance try the patch from #7388 and see if the problem is fixed.

nivedita76 · 2018-04-04T23:09:29Z

Got an oops.

[ 62.868131] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 62.868522] IP: multilist_sublist_lock_obj+0x9/0x20
[ 62.868868] PGD 0 P4D 0
[ 62.869210] Oops: 0000 [#1] SMP PTI
[ 62.869579] Modules linked in: macvtap macvlan tap bridge devlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables nvidia_uvm(OE) 8021q garp mrp stp llc dm_mod snd_hda_codec_hdmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel vboxpci(OE) aes_x86_64 crypto_simd glue_helper cryptd vboxnetadp(OE) intel_cstate snd_usb_audio vboxnetflt(OE) vboxdrv(OE) intel_uncore snd_usbmidi_lib snd_rawmidi intel_rapl_perf snd_seq_device nvidia_drm(OE) nvidia_modeset(OE) snd_hda_codec_realtek snd_hda_codec_generic nvidia(OE) snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd soundcore
[ 62.872350] CPU: 55 PID: 7977 Comm: mysqld Tainted: G W OE 4.16.0-rani-zfs+ #7
[ 62.872806] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D8 WS/Z10PE-D8 WS, BIOS 3407 03/10/2017
[ 62.873275] RIP: 0010:multilist_sublist_lock_obj+0x9/0x20
[ 62.873748] RSP: 0018:ffffbec4f0123828 EFLAGS: 00010246
[ 62.874222] RAX: ffffa1ae31c74000 RBX: 0000000000000000 RCX: 0000000000000000
[ 62.874706] RDX: ffffffff903a25c0 RSI: ffffa1ae07b47638 RDI: 0000000000000000
[ 62.875194] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[ 62.875682] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa1ae07b47798
[ 62.876170] R13: ffffa1adf6edf0f8 R14: 0000000000000000 R15: 0000000000000002
[ 62.876659] FS: 00007f67fcdbbe00(0000) GS:ffffa1ae3e800000(0000) knlGS:0000000000000000
[ 62.877158] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 62.877656] CR2: 0000000000000018 CR3: 0000000ff68e0003 CR4: 00000000003626e0
[ 62.878161] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 62.878657] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 62.879185] Call Trace:
[ 62.879770] dnode_check_slots_free+0x107/0x1c0
[ 62.880397] dnode_hold_impl+0x5d5/0xf30
[ 62.880887] dmu_object_alloc_dnsize+0x241/0x520
[ 62.881385] zfs_mknode+0x130/0xf70
[ 62.881887] ? _raw_spin_unlock+0x24/0x30
[ 62.882384] ? dmu_tx_assign+0x230/0x560
[ 62.882885] zfs_create+0x6b9/0x9e0
[ 62.883388] zpl_create+0xaa/0x160
[ 62.883917] lookup_open+0x5da/0x780
[ 62.884446] path_openat+0x32c/0xd10
[ 62.884944] do_filp_open+0xb4/0x130
[ 62.885449] ? do_sys_open+0x1c7/0x250
[ 62.885938] do_sys_open+0x1c7/0x250
[ 62.886412] do_syscall_64+0x66/0x120
[ 62.886881] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable. This patch adds a check for whether dn_dirty_link is active to determine if we are in this state, avoiding the crash. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>

Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable. This patch adds the ability for dnodes to know which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>

Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Fixes: openzfs#7147 Signed-off-by: Tom Caputi <[email protected]>

Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Requires-spl: spl-0.7-release Issue openzfs#7147 Issue openzfs#7388 Issue openzfs#7997

Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#7147 Closes openzfs#7388

Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7147 Closes #7388

This was referenced Feb 9, 2018

Scrub gets stuck, becomes unstoppable and locks up user processes in uninterruptible sleep #7059

Closed

Encryption and Raw Send Stability Improvements #6864

Closed

lnicola mentioned this issue Feb 9, 2018

VERIFY3(0 == dmu_object_claim_dnsize()) failed (0 == 28) #7151

Closed

tcaputi added this to the 0.8.0 milestone Feb 9, 2018

lnicola closed this as completed Feb 28, 2018

lnicola reopened this Mar 8, 2018

behlendorf closed this as completed in edc1e71 Apr 10, 2018

rigred mentioned this issue Oct 20, 2018

General protection fault - userquota_updates_task #8048

Closed

rfehren mentioned this issue Jul 31, 2019

General protection fault - userquota_updates_task -- still #9104

Closed

rrevans mentioned this issue Dec 28, 2023

dnode_is_dirty: use dn_dirty_txg to check dirtiness #15615

Draft

13 tasks

userquota_updates_task NULL deref #7147

userquota_updates_task NULL deref #7147

Comments

lnicola commented Feb 8, 2018

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

tcaputi commented Feb 9, 2018

lnicola commented Feb 9, 2018 • edited Loading

lnicola commented Feb 9, 2018

tcaputi commented Feb 9, 2018

lnicola commented Feb 9, 2018 • edited Loading

tcaputi commented Feb 9, 2018

lnicola commented Feb 9, 2018

tcaputi commented Feb 9, 2018

lnicola commented Feb 9, 2018 • edited Loading

tcaputi commented Feb 9, 2018

tcaputi commented Feb 9, 2018

tcaputi commented Feb 13, 2018

lnicola commented Feb 13, 2018 • edited Loading

lnicola commented Feb 13, 2018 • edited Loading

tcaputi commented Feb 21, 2018

lnicola commented Feb 21, 2018 • edited Loading

tcaputi commented Feb 21, 2018

tcaputi commented Feb 28, 2018

lnicola commented Feb 28, 2018

lnicola commented Mar 8, 2018 • edited Loading

lnicola commented Mar 11, 2018

tcaputi commented Mar 11, 2018

lnicola commented Mar 28, 2018

tcaputi commented Mar 28, 2018 • edited Loading

nivedita76 commented Apr 1, 2018

nivedita76 commented Apr 1, 2018

nivedita76 commented Apr 1, 2018

nivedita76 commented Apr 1, 2018

nivedita76 commented Apr 1, 2018

tcaputi commented Apr 2, 2018

nivedita76 commented Apr 2, 2018

tcaputi commented Apr 2, 2018

tcaputi commented Apr 3, 2018

nivedita76 commented Apr 3, 2018

lnicola commented Apr 4, 2018 • edited Loading

tcaputi commented Apr 4, 2018

tcaputi commented Apr 4, 2018

nivedita76 commented Apr 4, 2018

lnicola commented Feb 9, 2018 •

edited

Loading

lnicola commented Feb 9, 2018 •

edited

Loading

lnicola commented Feb 9, 2018 •

edited

Loading

lnicola commented Feb 13, 2018 •

edited

Loading

lnicola commented Feb 13, 2018 •

edited

Loading

lnicola commented Feb 21, 2018 •

edited

Loading

lnicola commented Mar 8, 2018 •

edited

Loading

tcaputi commented Mar 28, 2018 •

edited

Loading

lnicola commented Apr 4, 2018 •

edited

Loading