Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master panics #201

Closed
rottegift opened this issue Jun 23, 2014 · 139 comments
Closed

master panics #201

rottegift opened this issue Jun 23, 2014 · 139 comments

Comments

@rottegift
Copy link
Contributor

zfs mount -a did this:

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Mon Jun 23 12:48:21 2014
panic(cpu 1 caller 0xffffff80272dbf5e): Kernel trap at 0xffffff7fa8f0d9d1, type 14=page fault, registers:
CR0: 0x0000000080010033, CR2: 0x00000000008a0000, CR3: 0x0000000014587149, CR4: 0x00000000001606e0
RAX: 0x00000000008a0000, RBX: 0xffffff8218ba3f10, RCX: 0x0000000000000000, RDX: 0xffffff82186b98b8
RSP: 0xffffff8218ba3840, RBP: 0xffffff8218ba3840, RSI: 0xffffff82186b98b8, RDI: 0xffffff82186b98b8
R8:  0xffffff8218ba38c8, R9:  0xffffff8218ba3890, R10: 0xffffff82151a51d0, R11: 0xffffffffffffff00
R12: 0x0000000000100003, R13: 0xffffff8060f9e4b0, R14: 0xffffff8218ba3e70, R15: 0x0000000000000048
RFL: 0x0000000000010286, RIP: 0xffffff7fa8f0d9d1, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x00000000008a0000, Error code: 0x0000000000000002, Fault CPU: 0x1

Backtrace (CPU 1), Frame : Return Address
0xffffff8218ba34d0 : 0xffffff8027222fa9 mach_kernel : _panic + 0xc9
0xffffff8218ba3550 : 0xffffff80272dbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff8218ba3720 : 0xffffff80272f3456 mach_kernel : _return_from_trap + 0xe6
0xffffff8218ba3740 : 0xffffff7fa8f0d9d1 net.lundman.spl : _list_insert_tail + 0x21
0xffffff8218ba3840 : 0xffffff7fa8f66e8a net.lundman.zfs : _dsl_dir_tempreserve_impl + 0x38a
0xffffff8218ba3920 : 0xffffff7fa8f66abd net.lundman.zfs : _dsl_dir_tempreserve_space + 0x17d
0xffffff8218ba39a0 : 0xffffff7fa8f4d027 net.lundman.zfs : _dmu_tx_try_assign + 0x447
0xffffff8218ba3a50 : 0xffffff7fa8f4cb6b net.lundman.zfs : _dmu_tx_assign + 0x3b
0xffffff8218ba3a80 : 0xffffff7fa8ff5d8b net.lundman.zfs : _zfs_write + 0x9eb
0xffffff8218ba3d00 : 0xffffff7fa9004ffe net.lundman.zfs : _zfs_vnop_write + 0xae
0xffffff8218ba3d60 : 0xffffff80273fde51 mach_kernel : _VNOP_WRITE + 0xe1
0xffffff8218ba3de0 : 0xffffff80273f3be3 mach_kernel : _utf8_normalizestr + 0x703
0xffffff8218ba3e50 : 0xffffff80275f22f1 mach_kernel : _write_nocancel + 0x1b1
0xffffff8218ba3ef0 : 0xffffff80275f21dd mach_kernel : _write_nocancel + 0x9d
0xffffff8218ba3f50 : 0xffffff8027640653 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff8218ba3fb0 : 0xffffff80272f3c56 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.spl(1.0)[626E72AC-DE97-3631-ADC2-0849D468CF64]@0xffffff7fa8f0b000->0xffffff7fa8f1bfff
         net.lundman.zfs(1.0)[AEAAAB93-CD2F-3028-8BFD-D0F399FAA6BE]@0xffffff7fa8f1c000->0xffffff7fa9128fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7fa77f3000
            dependency: net.lundman.spl(1.0.0)[626E72AC-DE97-3631-ADC2-0849D468CF64]@0xffffff7fa8f0b000

BSD process name corresponding to current thread: squid
Boot args: -v keepsyms=y darkwake=0 -s
@rottegift
Copy link
Contributor Author

And previously exporting did this:

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Mon Jun 23 12:46:18 2014
panic(cpu 4 caller 0xffffff80086dbf5e): Kernel trap at 0xffffff7f8a42f624, type 13=general protection, registers:
CR0: 0x0000000080010033, CR2: 0x00007fff5ca8c828, CR3: 0x00000001070e805b, CR4: 0x00000000001606e0
RAX: 0x74697265686e6924, RBX: 0xffffff82134bbd58, RCX: 0xffffff7f8a4b48c0, RDX: 0x0000000000000001
RSP: 0xffffff82134bb6b0, RBP: 0xffffff82134bb6e0, RSI: 0xffffff82134bb878, RDI: 0xffffff826bfd7b28
R8:  0x0000000000000074, R9:  0x0000000000000220, R10: 0xffffff82134bb878, R11: 0xffffff82134bb500
R12: 0x0000000012000530, R13: 0xffffff803aa7a000, R14: 0x0000000012000530, R15: 0xffffff802d9a8a80
RFL: 0x0000000000010206, RIP: 0xffffff7f8a42f624, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x00007fff5ca8c828, Error code: 0x0000000000000000, Fault CPU: 0x4

Backtrace (CPU 4), Frame : Return Address
0xffffff81e8875df0 : 0xffffff8008622fa9 mach_kernel : _panic + 0xc9
0xffffff81e8875e70 : 0xffffff80086dbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff81e8876040 : 0xffffff80086f3456 mach_kernel : _return_from_trap + 0xe6
0xffffff81e8876060 : 0xffffff7f8a42f624 net.lundman.zfs : _nvlist_exists + 0x54
0xffffff82134bb6e0 : 0xffffff7f8a36e20b net.lundman.zfs : _dsl_prop_get_all_impl + 0x36b
0xffffff82134bb9e0 : 0xffffff7f8a36dacc net.lundman.zfs : _dsl_prop_get_all_ds + 0x19c
0xffffff82134bbb40 : 0xffffff7f8a36d925 net.lundman.zfs : _dsl_prop_get_all + 0x25
0xffffff82134bbb60 : 0xffffff7f8a3e4e4d net.lundman.zfs : _zfs_ioc_objset_stats_impl + 0x4d
0xffffff82134bbba0 : 0xffffff7f8a3e1c7b net.lundman.zfs : _zfs_ioc_snapshot_list_next + 0x1ab
0xffffff82134bbc20 : 0xffffff7f8a3dd7a4 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff82134bbcf0 : 0xffffff800880d63f mach_kernel : _spec_ioctl + 0x11f
0xffffff82134bbd40 : 0xffffff80087fe000 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff82134bbdc0 : 0xffffff80087f3e51 mach_kernel : _utf8_normalizestr + 0x971
0xffffff82134bbe10 : 0xffffff80089c1303 mach_kernel : _fo_ioctl + 0x43
0xffffff82134bbe40 : 0xffffff80089f2c66 mach_kernel : _ioctl + 0x466
0xffffff82134bbf50 : 0xffffff8008a40653 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff82134bbfb0 : 0xffffff80086f3c56 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[AEAAAB93-CD2F-3028-8BFD-D0F399FAA6BE]@0xffffff7f8a31c000->0xffffff7f8a528fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f88bf3000
            dependency: net.lundman.spl(1.0.0)[626E72AC-DE97-3631-ADC2-0849D468CF64]@0xffffff7f8a30b000

BSD process name corresponding to current thread: zpool
Boot args: -v keepsyms=y darkwake=0

@lundman
Copy link
Contributor

lundman commented Jun 23, 2014

The first panic, we have:

    tr_list = kmem_alloc(sizeof (list_t), KM_PUSHPAGE);
    list_create(tr_list, sizeof (struct tempreserve),
        offsetof(struct tempreserve, tr_node));

which allocates with PUSHPAGE, but even if we did return NULL here, we would panic in list_create.

Then we call;

dsl_dir_tempreserve_impl(dsl_dir_t *dd, uint64_t asize, boolean_t netfree,
    boolean_t ignorequota, boolean_t checkrefquota, list_t *tr_list,
    dmu_tx_t *tx, boolean_t first)

which does;

    tr = kmem_zalloc(sizeof (struct tempreserve), KM_PUSHPAGE);
    tr->tr_ds = dd;
    tr->tr_size = asize;
    list_insert_tail(tr_list, tr);

again, PUSHPAGE, and even if NULL, assigning tr would panic, not the call to list_insert_tail. This would then indicate memory corruption, since both tr and tr_list here are most likely good. No thread issues, since it is allocated for this thread and only used here once.

We have seen a few panics in the vnop_write path though, could be an issue there?

@lundman
Copy link
Contributor

lundman commented Jun 24, 2014

Second panic

    nvlist_t *nv;
        (error = dsl_prop_get_all(os, &nv)) == 0) {
dsl_prop_get_all_ds(dsl_dataset_t *ds, nvlist_t **nvp,
    dsl_prop_getflags_t flags)
{
...
    VERIFY(nvlist_alloc(nvp, NV_UNIQUE_NAME, KM_SLEEP) == 0);
...
        err = dsl_prop_get_all_impl(mos, ds->ds_phys->ds_props_obj,
            setpoint, flags, *nvp);
dsl_prop_get_all_impl(objset_t *mos, uint64_t propobj,
    const char *setpoint, dsl_prop_getflags_t flags, nvlist_t *nv)
{
...
        /* Skip properties already defined. */
        if (nvlist_exists(nv, propname))
            continue;

So, it definitely initialises nvp (or VERIFY panics). But we have no way to know the status of propname. It is not used prior in a way that would definitely panic.

@lundman
Copy link
Contributor

lundman commented Jun 26, 2014

Actually, do you think I could convince you to run the kexts from todays DMG?

http://lundman.net/ftp/osx.zfs/osx.zfs-signed-20140627.dmg

Nothing special in it, but rules out the compile environment and related bits.

@rottegift
Copy link
Contributor Author

Well the most recent master just took out half of two mirrored vdevs in two pools and resilvering will take a lonnnnnnnnnnnnng time, so I'm a bit keen to avoid having another vdev taken out. Mere panics have never done that.

@rottegift
Copy link
Contributor Author

(In fact the bigger shock was a pool that never imports automatically (it's deliberately never in zpool.cache, ever) managed to get imported, and that is one of the pools resilvering.)

@lundman
Copy link
Contributor

lundman commented Jun 27, 2014

Yeah, I think 95ff805 changed the kext so we don't load cache file, but also changed zfs.util to "list pools" + "import pools". You don't want the new zfs.util to run, until we can make it better. Just the two kexts from dmg is what I mean, no zed, no zfs.util.

But I can understand the hesitation. We will have to think about what to do about pools that users dont want imported.

@rottegift
Copy link
Contributor Author

I can do that in the alternate boot environment I have on the same machine, but I'll wait to see how long the resilvering takes.

"We will have to think about what to do about pools that users dont want imported."

Please also think about not astonishing people who work with other zfs implementations, none of which import pools automatically unless they're in zpool.cache. (e.g., think of cachefile pool property)

@lundman
Copy link
Contributor

lundman commented Jun 27, 2014

Yes, that is the discussion. One, the cache file controls what imports on boot.

But also on OSX after boot, if you plug in a device with ZFS on it, the OSX users expect it to be imported and mounted. We have to follow this as well, even if it is not done on other platforms (by default).

@rottegift
Copy link
Contributor Author

Ok, I now have two ways I think I can provoke two different panics (one a double panic that leaves nothing sensible in the crash dump, one that seems to), so I'm now in an alternate boot environment that has never had zfs installed on it.

download and mount today's dog, run load_zfs.sh, zpool import: look ok. i'll carry on with tests and update here later.

@rottegift
Copy link
Contributor Author

PS: If you're going to depart wholeheartedly from having a zpool.cache, that's fine, but the documentation job will be important. If you're going to do a small departure, say, automatically import ONLY (!!!) single-device pools, that's fine too. However, I personally would prefer to be as close to openzfs as possible.

Another option would be an o3x-specific pool feature that would allow automatic import if active. If it's important that a pool with the feature active be imported on another zfs implementation, sell the idea to openzfs developers.

@rottegift
Copy link
Contributor Author

Hmm... while not conclusive, I haven't gotten it to crash yet.

@rottegift
Copy link
Contributor Author

Ah, but the pools' cache vdevs are ONLINE but are not actually being written to.

@lundman
Copy link
Contributor

lundman commented Jun 27, 2014

Having kext reading cachefile seems to be problematic on OSX since the device names/orders change. Neither master nor vdev-iokit handle that rename yet.

One thought is to take the cachefile reading out of kext (no auto import there) and let zfs.util handle it. Just adding some code to zfs.util to read the cachefile, and import only those in there.

@rottegift
Copy link
Contributor Author

YAY! YAY! A panic!

Anonymous UUID:       2A7EBDDF-80E2-8419-C1B5-5A9410E11B67

Fri Jun 27 02:04:33 2014
panic(cpu 2 caller 0xffffff802f6dbf5e): Kernel trap at 0xffffff7fb15fe300, type 14=page fault, registers:
CR0: 0x000000008001003b, CR2: 0x0000000000000170, CR3: 0x0000000031bb4000, CR4: 0x00000000001606e0
RAX: 0xffffff823651bc40, RBX: 0xffffff823039a6d8, RCX: 0x0000000000000108, RDX: 0x0000000000000108
RSP: 0xffffff82300b39c0, RBP: 0xffffff82300b39f0, RSI: 0x0000000000000001, RDI: 0x0000000000000000
R8:  0x00000000ffff0000, R9:  0x00000000ffff0000, R10: 0xffffff82300b3cc0, R11: 0xffffffffffffffff
R12: 0xffffff823039a718, R13: 0x000000000000a7ba, R14: 0xffffff823039a6f8, R15: 0xffffff8239e30538
RFL: 0x0000000000010202, RIP: 0xffffff7fb15fe300, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x0000000000000170, Error code: 0x0000000000000000, Fault CPU: 0x2

Backtrace (CPU 2), Frame : Return Address
0xffffff82300b3650 : 0xffffff802f622fa9 mach_kernel : _panic + 0xc9
0xffffff82300b36d0 : 0xffffff802f6dbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff82300b38a0 : 0xffffff802f6f3456 mach_kernel : _return_from_trap + 0xe6
0xffffff82300b38c0 : 0xffffff7fb15fe300 net.lundman.zfs : _vdev_dtl_contains + 0x40
0xffffff82300b39f0 : 0xffffff7fb1606fb9 net.lundman.zfs : _vdev_mirror_io_done + 0x329
0xffffff82300b3a90 : 0xffffff7fb16715a9 net.lundman.zfs : _zio_vdev_io_done + 0x1a9
0xffffff82300b3ad0 : 0xffffff7fb166df4a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff82300b3b10 : 0xffffff7fb1672b92 net.lundman.zfs : _zio_done + 0x11d2
0xffffff82300b3c30 : 0xffffff7fb166df4a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff82300b3c70 : 0xffffff7fb1672b92 net.lundman.zfs : _zio_done + 0x11d2
0xffffff82300b3d90 : 0xffffff7fb166df4a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff82300b3dd0 : 0xffffff7fb1672b92 net.lundman.zfs : _zio_done + 0x11d2
0xffffff82300b3ef0 : 0xffffff7fb166df4a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff82300b3f30 : 0xffffff7fb166de15 net.lundman.zfs : _zio_execute + 0x15
0xffffff82300b3f50 : 0xffffff7fb1562217 net.lundman.spl : _taskq_thread + 0xc7
0xffffff82300b3fb0 : 0xffffff802f6d7127 mach_kernel : _call_continuation + 0x17
      Kernel Extensions in backtrace:
         net.lundman.spl(1.0)[43937DBE-7CC1-36E2-BFF0-AA5430ABBC60]@0xffffff7fb155e000->0xffffff7fb156efff
         net.lundman.zfs(1.0)[170185FD-89A4-3CB0-85E1-E3807CCEDBEC]@0xffffff7fb156f000->0xffffff7fb177bfff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7fafe9b000
            dependency: net.lundman.spl(1.0.0)[43937DBE-7CC1-36E2-BFF0-AA5430ABBC60]@0xffffff7fb155e000

BSD process name corresponding to current thread: kernel_task
Boot args: -v keepsyms=y

@rottegift
Copy link
Contributor Author

vdev-iokit has a big operational advantage at the moment in that shortly after boot it reliably imports pools with many vdevs and leaves them all (including the cache vdevs) working correctly, every time. master practically always has issues shortly after boot: FAULTED pools, cache vdevs not working, etc.

@rottegift
Copy link
Contributor Author

At the time of the crash above, two pools were resilvering, one of those was the source of some zfs send traffic. A third pool was being beaten up by diskutil verifyvolume concurrently on two zvols. And a fourth was doing minor I/O. IIRC, I think I was doing about 3k IOPS.

@lundman
Copy link
Contributor

lundman commented Jun 27, 2014

So you are saying that vdev_iokit handles the device renaming and autoimport on boot just fine?

@rottegift
Copy link
Contributor Author

Yes, that is what I am saying (about vdev-iokit). Given the number of times I've restarted lately, it is a huge advantage to vdev-iokit.

Oh, back and loaded the kexts again. (no cachefile exists to be clear). At import time the cache vdevs again are ONLINE but are not used. Ordinarily (and again, always in vdev-iokit) there would be a couple meg written to them as fast as you could run zpool iostat -v upon import.

@rottegift
Copy link
Contributor Author

Yes! Another panic.

This was exporting CLATM after its resilver had finished. Trinity was naturally still resilvering and was also the source of a zfs send.

Anonymous UUID:       2A7EBDDF-80E2-8419-C1B5-5A9410E11B67

Fri Jun 27 02:36:37 2014
panic(cpu 4 caller 0xffffff800a2dbf5e): Kernel trap at 0xffffff7f8c282614, type 13=general protection, registers:
CR0: 0x000000008001003b, CR2: 0x00007fbd79d3a000, CR3: 0x00000000339eb02c, CR4: 0x00000000001606e0
RAX: 0x74697265686e6924, RBX: 0xffffff8241d2bd58, RCX: 0xffffff7f8c3078c0, RDX: 0x0000000000000001
RSP: 0xffffff8241d2b6b0, RBP: 0xffffff8241d2b6e0, RSI: 0xffffff8241d2b878, RDI: 0xffffff81fd81cf88
R8:  0x0000000000000074, R9:  0x0000000000000220, R10: 0xffffff8241d2b878, R11: 0xffffff8241d2b500
R12: 0x000000002200011d, R13: 0xffffff803aba44b0, R14: 0x000000002200011d, R15: 0xffffff802beb3540
RFL: 0x0000000000010206, RIP: 0xffffff7f8c282614, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x00007fbd79d3a000, Error code: 0x0000000000000000, Fault CPU: 0x4

Backtrace (CPU 4), Frame : Return Address
0xffffff81ea80ddf0 : 0xffffff800a222fa9 mach_kernel : _panic + 0xc9
0xffffff81ea80de70 : 0xffffff800a2dbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff81ea80e040 : 0xffffff800a2f3456 mach_kernel : _return_from_trap + 0xe6
0xffffff81ea80e060 : 0xffffff7f8c282614 net.lundman.zfs : _nvlist_exists + 0x54
0xffffff8241d2b6e0 : 0xffffff7f8c1c110b net.lundman.zfs : _dsl_prop_get_all_impl + 0x36b
0xffffff8241d2b9e0 : 0xffffff7f8c1c09cc net.lundman.zfs : _dsl_prop_get_all_ds + 0x19c
0xffffff8241d2bb40 : 0xffffff7f8c1c0825 net.lundman.zfs : _dsl_prop_get_all + 0x25
0xffffff8241d2bb60 : 0xffffff7f8c237e2d net.lundman.zfs : _zfs_ioc_objset_stats_impl + 0x4d
0xffffff8241d2bba0 : 0xffffff7f8c234c5b net.lundman.zfs : _zfs_ioc_snapshot_list_next + 0x1ab
0xffffff8241d2bc20 : 0xffffff7f8c230784 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff8241d2bcf0 : 0xffffff800a40d63f mach_kernel : _spec_ioctl + 0x11f
0xffffff8241d2bd40 : 0xffffff800a3fe000 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff8241d2bdc0 : 0xffffff800a3f3e51 mach_kernel : _utf8_normalizestr + 0x971
0xffffff8241d2be10 : 0xffffff800a5c1303 mach_kernel : _fo_ioctl + 0x43
0xffffff8241d2be40 : 0xffffff800a5f2c66 mach_kernel : _ioctl + 0x466
0xffffff8241d2bf50 : 0xffffff800a640653 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff8241d2bfb0 : 0xffffff800a2f3c56 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[170185FD-89A4-3CB0-85E1-E3807CCEDBEC]@0xffffff7f8c16f000->0xffffff7f8c37bfff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f8aa9b000
            dependency: net.lundman.spl(1.0.0)[43937DBE-7CC1-36E2-BFF0-AA5430ABBC60]@0xffffff7f8c15e000

BSD process name corresponding to current thread: real_zfs
Boot args: -v keepsyms=y

@rottegift
Copy link
Contributor Author

This time, "zpool import -N Trinity" (after loading the ftp kexts as above) results in cache vdevs not only being ONLINE but also seeing traffic

cache                          -      -      -      -      -      -
  disk32s4                  942K  17.0G      0      0  1.19K  8.48K
  disk31s4                 2.87M  17.0G      0      0  1.19K  26.0K

@lundman
Copy link
Contributor

lundman commented Jun 27, 2014

Certainly seeing that zfs_ioc_snapshot_list_next frequently enough that it might just have an issue there.

Do you have a second machine/VM that you can run lldb from on panic? or should we see if darwin still can send cores to tftp..

@rottegift
Copy link
Contributor Author

If 10.8 lldb can talk to 10.9 kernel, then yes, otherwise I could do the tftp trick relatively easily (is it a boot argument?), or cough up another 10.9 box, but probably none of the above until some hours from now.

@lundman
Copy link
Contributor

lundman commented Jun 27, 2014

It can be picking about lldb version, I'm not sure. As long as the kernel debug dmg is exactly the same it might work. Or I can connect from here, but that is awkward with timezones and all that.

@rottegift
Copy link
Contributor Author

Tricky. We'll have to come back to it when I am more functional (timezones...)

The job that sends snapshots from 52 datasets to a mother machine just terminated correctly. The main difference is that it was running while the system had very little load (just a resilver, really). Doing the same job while the machine is under heavy load (multiple resilvers/scrubs, heavy activity on zvols) tends to coincide with a panic.

Although it cannot be a factor in this boot environment, I do run zfs snapshots on several datasets about once a minute, and the last syslog entry before several BOOT_TIME entries has often been from zfs-auto-snapshot.

There are a couple of other tasks (that are independent of any zfs send/receive or snapshot activity) that trigger a panic as noted in the long vdev-iokit thread with @evansus -- I will try to get around to doing that with the ftp kexts tomorrow. He's also been trying to reproduce them, I think.

I don't know whether the panics have a single common source or whether heavy load exposes multiple problems. I'm (now) pretty confident it's not hardware or build environment related, though.

@lundman
Copy link
Contributor

lundman commented Jun 27, 2014

Indeed, if you happen to get a panic while we are both on (and pool can be down for ~10mins) log on to irc and maybe we can sort something out. If you are in a LAN, we can create your account on o3x.org for ssh tunnel.

Meanwhile, I am wondering if there is some way that zio structs can be corrupted and/or used-after-free and/or running-out-of, but they use straight kmem_cache_alloc, which is on top of bmem.

It is true the taskq code in SPL is almost a straight paste job from MacZFS, and should probably be inspected for heavy-load cases. But since we used to spawn ~270 threads for each pool, perhaps it is not related.

@lundman
Copy link
Contributor

lundman commented Jun 27, 2014

Do you remember if you still had panics when NOT doing one of the tasks? Like if not zfs send, or if not using zvol? Hmm I guess I would only suspect zvol in that list. Do panics happen with no zvols in use?

@rottegift
Copy link
Contributor Author

Well I seem to be able to provoke a panic in a relatively short timeframe by deliberately firing up real traffic-generating jobs concurrently -- not exactly on demand and not an easy-to-reduce test case, but maybe easier than deciding on what IRC client to use and where I'm leaking my public IP addresses. :-)

In answer to your question above, yes, I have had panics when not doing zfs send, and when there have been no pools with zvols imported at all. Beating up zvols makes panics happen more quickly, but it's not necessary. The last panic ( #201 (comment) ) was zvol-free.

@rottegift
Copy link
Contributor Author

Another trigger, for example, is high inbound P2P (the relevant thing is many multiple writing threads rather than one fast source) load onto a slow-to-write pool. No zvols or zfs send involved.

Opening hundreds of Safari and Chrome windows and tabs has also been an apparent trigger, mostly because of the cache activity of the browsers vs a squid cache that is on the same pool. Again that's lots of threads/processes writing a lot.

@rottegift
Copy link
Contributor Author

Again, was AFK for a bit and returned to this.

There was a big resilver going on, but nothing else out of the ordinary (the load was very light, for instance).

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Fri Jul  4 16:26:28 2014
panic(cpu 1 caller 0xffffff802aadc24e): Kernel trap at 0xffffff7fab127dd4, type 13=general protection, registers:
CR0: 0x000000008001003b, CR2: 0x00007fff19617200, CR3: 0x000000002d005000, CR4: 0x00000000001606e0
RAX: 0x657661532f797251, RBX: 0xffffff821a7a2a38, RCX: 0xffffff823171b630, RDX: 0x657661532f797251
RSP: 0xffffff8288b83da0, RBP: 0xffffff8288b83dd0, RSI: 0xffffff83251e4008, RDI: 0xffffff823171b730
R8:  0x000000000000003f, R9:  0x0000000000000000, R10: 0xffffff8288b83cc0, R11: 0xffffff802b089ce8
R12: 0xffffff821a7a2a78, R13: 0x00000000000250d3, R14: 0xffffff821a7a2a58, R15: 0xffffff821ee0c1a8
RFL: 0x0000000000010202, RIP: 0xffffff7fab127dd4, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x00007fff19617200, Error code: 0x0000000000000000, Fault CPU: 0x1

Backtrace (CPU 1), Frame : Return Address
0xffffff820b02ddf0 : 0xffffff802aa22f79 mach_kernel : _panic + 0xc9
0xffffff820b02de70 : 0xffffff802aadc24e mach_kernel : _kernel_trap + 0x7fe
0xffffff820b02e040 : 0xffffff802aaf3746 mach_kernel : _return_from_trap + 0xe6
0xffffff820b02e060 : 0xffffff7fab127dd4 net.lundman.zfs : _zio_walk_parents + 0x94
0xffffff8288b83dd0 : 0xffffff7fab12ea53 net.lundman.zfs : _zio_done + 0x1073
0xffffff8288b83ef0 : 0xffffff7fab129f6a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff8288b83f30 : 0xffffff7fab129e35 net.lundman.zfs : _zio_execute + 0x15
0xffffff8288b83f50 : 0xffffff7fab01e1c7 net.lundman.spl : _taskq_thread + 0xc7
0xffffff8288b83fb0 : 0xffffff802aad7417 mach_kernel : _call_continuation + 0x17
      Kernel Extensions in backtrace:
         net.lundman.spl(1.0)[1185F90C-FE6B-34EB-B385-C360EB991B37]@0xffffff7fab01a000->0xffffff7fab02afff
         net.lundman.zfs(1.0)[0CE90CF6-C355-3255-A550-3E2FA41D5811]@0xffffff7fab02b000->0xffffff7fab238fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7faafec000
            dependency: net.lundman.spl(1.0.0)[1185F90C-FE6B-34EB-B385-C360EB991B37]@0xffffff7fab01a000

BSD process name corresponding to current thread: kernel_task
Boot args: -v keepsyms=y darkwake=0

@rottegift
Copy link
Contributor Author

Another slightly different panic, again below zio somewhere, again during a scrub of ssdpool (since that is both fast and is implicated in panics).

(zfs @ issue197 + 95374b6 ; spl @ issue201 openzfsonosx/spl@9124b49 ; updating to the latest issue201 in moments)

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jul  8 10:53:28 2014
panic(cpu 2 caller 0xffffff80080dc24e): Kernel trap at 0xffffff7f887385e2, type 14=page fault, registers:
CR0: 0x000000008001003b, CR2: 0xffffff051cfbba00, CR3: 0x000000000a60b000, CR4: 0x00000000001606e0
RAX: 0x0000000000000000, RBX: 0xffffff82d49de000, RCX: 0xffffff81f34c2a18, RDX: 0xffffff051cfbba00
RSP: 0xffffff81f34c29b0, RBP: 0xffffff81f34c2a00, RSI: 0xffffff82d4a49b79, RDI: 0xffffff82d49de628
R8:  0xffffff82485bc158, R9:  0x0000000000000200, R10: 0x00000000003e0000, R11: 0xffffff82485bbec8
R12: 0xffffff802c2b8b20, R13: 0xffffff802dc668a0, R14: 0xffffff82d4a07cb8, R15: 0xffffff82d49fe678
RFL: 0x0000000000010282, RIP: 0xffffff7f887385e2, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0xffffff051cfbba00, Error code: 0x0000000000000002, Fault CPU: 0x2

Backtrace (CPU 2), Frame : Return Address
0xffffff81f34c2640 : 0xffffff8008022f79 mach_kernel : _panic + 0xc9
0xffffff81f34c26c0 : 0xffffff80080dc24e mach_kernel : _kernel_trap + 0x7fe
0xffffff81f34c2890 : 0xffffff80080f3746 mach_kernel : _return_from_trap + 0xe6
0xffffff81f34c28b0 : 0xffffff7f887385e2 net.lundman.zfs : _avl_insert + 0x62
0xffffff81f34c2a00 : 0xffffff7f88738dc4 net.lundman.zfs : _avl_add + 0x54
0xffffff81f34c2a30 : 0xffffff7f886c435e net.lundman.zfs : _vdev_queue_io_add + 0x3e
0xffffff81f34c2a50 : 0xffffff7f886c42a4 net.lundman.zfs : _vdev_queue_io + 0x114
0xffffff81f34c2a80 : 0xffffff7f8872d2aa net.lundman.zfs : _zio_vdev_io_start + 0x37a
0xffffff81f34c2af0 : 0xffffff7f88729e7a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff81f34c2b30 : 0xffffff7f887293bb net.lundman.zfs : _zio_nowait + 0x5b
0xffffff81f34c2b50 : 0xffffff7f886c2a7e net.lundman.zfs : _vdev_mirror_io_start + 0x14e
0xffffff81f34c2c10 : 0xffffff7f8872cfac net.lundman.zfs : _zio_vdev_io_start + 0x7c
0xffffff81f34c2c80 : 0xffffff7f88729e7a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff81f34c2cc0 : 0xffffff7f887293bb net.lundman.zfs : _zio_nowait + 0x5b
0xffffff81f34c2ce0 : 0xffffff7f88681e35 net.lundman.zfs : _dsl_scan_scrub_cb + 0x475
0xffffff81f34c2d90 : 0xffffff7f8868118f net.lundman.zfs : _dsl_scan_visitbp + 0x98f
0xffffff81f34c2f10 : 0xffffff7f886817fe net.lundman.zfs : _dsl_scan_visitdnode + 0xfe
0xffffff81f34c2fd0 : 0xffffff7f88680eb0 net.lundman.zfs : _dsl_scan_visitbp + 0x6b0
0xffffff81f34c3150 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c32d0 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c3450 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c35d0 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c3750 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c38d0 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c3a50 : 0xffffff7f886817fe net.lundman.zfs : _dsl_scan_visitdnode + 0xfe
0xffffff81f34c3b10 : 0xffffff7f88680fe6 net.lundman.zfs : _dsl_scan_visitbp + 0x7e6
0xffffff81f34c3c90 : 0xffffff7f8867fc34 net.lundman.zfs : _dsl_scan_visit_rootbp + 0xa4
0xffffff81f34c3d00 : 0xffffff7f8867ff05 net.lundman.zfs : _dsl_scan_visitds + 0x125
0xffffff81f34c3d70 : 0xffffff7f8867f514 net.lundman.zfs : _dsl_scan_visit + 0x1f4
0xffffff81f34c3dd0 : 0xffffff7f8867e942 net.lundman.zfs : _dsl_scan_sync + 0x5d2
0xffffff81f34c3e30 : 0xffffff7f886a3952 net.lundman.zfs : _spa_sync + 0x4b2
0xffffff81f34c3ee0 : 0xffffff7f886b3f76 net.lundman.zfs : _txg_sync_thread + 0x3e6
0xffffff81f34c3fb0 : 0xffffff80080d7417 mach_kernel : _call_continuation + 0x17
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[557243BF-0246-3E41-B4B2-4E917FA522BB]@0xffffff7f8862b000->0xffffff7f88838fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f885ec000
            dependency: net.lundman.spl(1.0.0)[872B88E3-C60B-3AD3-920D-9CD93E7AD5E3]@0xffffff7f8861a000

@lundman
Copy link
Contributor

lundman commented Jul 8, 2014

we have something big we are tracking down at the moment, probably have something for you to try tomorrow.

@rottegift
Copy link
Contributor Author

openzfsonosx/spl@ffdef74 panics every time in the sequence:

  1. zpool scrub ssdpool; rm /etc/zfs/zpool.cache*; reboot with that spl
  2. zpool import -N ssdpool
  3. zfs mount -a

(it also fails with zpool import ssdpool)

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jul  8 10:53:28 2014
panic(cpu 2 caller 0xffffff80080dc24e): Kernel trap at 0xffffff7f887385e2, type 14=page fault, registers:
CR0: 0x000000008001003b, CR2: 0xffffff051cfbba00, CR3: 0x000000000a60b000, CR4: 0x00000000001606e0
RAX: 0x0000000000000000, RBX: 0xffffff82d49de000, RCX: 0xffffff81f34c2a18, RDX: 0xffffff051cfbba00
RSP: 0xffffff81f34c29b0, RBP: 0xffffff81f34c2a00, RSI: 0xffffff82d4a49b79, RDI: 0xffffff82d49de628
R8:  0xffffff82485bc158, R9:  0x0000000000000200, R10: 0x00000000003e0000, R11: 0xffffff82485bbec8
R12: 0xffffff802c2b8b20, R13: 0xffffff802dc668a0, R14: 0xffffff82d4a07cb8, R15: 0xffffff82d49fe678
RFL: 0x0000000000010282, RIP: 0xffffff7f887385e2, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0xffffff051cfbba00, Error code: 0x0000000000000002, Fault CPU: 0x2

Backtrace (CPU 2), Frame : Return Address
0xffffff81f34c2640 : 0xffffff8008022f79 mach_kernel : _panic + 0xc9
0xffffff81f34c26c0 : 0xffffff80080dc24e mach_kernel : _kernel_trap + 0x7fe
0xffffff81f34c2890 : 0xffffff80080f3746 mach_kernel : _return_from_trap + 0xe6
0xffffff81f34c28b0 : 0xffffff7f887385e2 net.lundman.zfs : _avl_insert + 0x62
0xffffff81f34c2a00 : 0xffffff7f88738dc4 net.lundman.zfs : _avl_add + 0x54
0xffffff81f34c2a30 : 0xffffff7f886c435e net.lundman.zfs : _vdev_queue_io_add + 0x3e
0xffffff81f34c2a50 : 0xffffff7f886c42a4 net.lundman.zfs : _vdev_queue_io + 0x114
0xffffff81f34c2a80 : 0xffffff7f8872d2aa net.lundman.zfs : _zio_vdev_io_start + 0x37a
0xffffff81f34c2af0 : 0xffffff7f88729e7a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff81f34c2b30 : 0xffffff7f887293bb net.lundman.zfs : _zio_nowait + 0x5b
0xffffff81f34c2b50 : 0xffffff7f886c2a7e net.lundman.zfs : _vdev_mirror_io_start + 0x14e
0xffffff81f34c2c10 : 0xffffff7f8872cfac net.lundman.zfs : _zio_vdev_io_start + 0x7c
0xffffff81f34c2c80 : 0xffffff7f88729e7a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff81f34c2cc0 : 0xffffff7f887293bb net.lundman.zfs : _zio_nowait + 0x5b
0xffffff81f34c2ce0 : 0xffffff7f88681e35 net.lundman.zfs : _dsl_scan_scrub_cb + 0x475
0xffffff81f34c2d90 : 0xffffff7f8868118f net.lundman.zfs : _dsl_scan_visitbp + 0x98f
0xffffff81f34c2f10 : 0xffffff7f886817fe net.lundman.zfs : _dsl_scan_visitdnode + 0xfe
0xffffff81f34c2fd0 : 0xffffff7f88680eb0 net.lundman.zfs : _dsl_scan_visitbp + 0x6b0
0xffffff81f34c3150 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c32d0 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c3450 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c35d0 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c3750 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c38d0 : 0xffffff7f88680b96 net.lundman.zfs : _dsl_scan_visitbp + 0x396
0xffffff81f34c3a50 : 0xffffff7f886817fe net.lundman.zfs : _dsl_scan_visitdnode + 0xfe
0xffffff81f34c3b10 : 0xffffff7f88680fe6 net.lundman.zfs : _dsl_scan_visitbp + 0x7e6
0xffffff81f34c3c90 : 0xffffff7f8867fc34 net.lundman.zfs : _dsl_scan_visit_rootbp + 0xa4
0xffffff81f34c3d00 : 0xffffff7f8867ff05 net.lundman.zfs : _dsl_scan_visitds + 0x125
0xffffff81f34c3d70 : 0xffffff7f8867f514 net.lundman.zfs : _dsl_scan_visit + 0x1f4
0xffffff81f34c3dd0 : 0xffffff7f8867e942 net.lundman.zfs : _dsl_scan_sync + 0x5d2
0xffffff81f34c3e30 : 0xffffff7f886a3952 net.lundman.zfs : _spa_sync + 0x4b2
0xffffff81f34c3ee0 : 0xffffff7f886b3f76 net.lundman.zfs : _txg_sync_thread + 0x3e6
0xffffff81f34c3fb0 : 0xffffff80080d7417 mach_kernel : _call_continuation + 0x17
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[557243BF-0246-3E41-B4B2-4E917FA522BB]@0xffffff7f8862b000->0xffffff7f88838fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f885ec000
            dependency: net.lundman.spl(1.0.0)[872B88E3-C60B-3AD3-920D-9CD93E7AD5E3]@0xffffff7f8861a000

BSD process name corresponding to current thread: kernel_task
Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jul  8 11:23:02 2014
panic(cpu 6 caller 0xffffff7f8d14139a): "returning same addy twice: 0xffffff81fc8ffc40\n"@spl-bmalloc.c:854
Backtrace (CPU 6), Frame : Return Address
0xffffff81f6df3240 : 0xffffff800b422f79 mach_kernel : _panic + 0xc9
0xffffff81f6df32c0 : 0xffffff7f8d14139a net.lundman.spl : _slice_allocator_alloc + 0x17a
0xffffff81f6df3300 : 0xffffff7f8d1418b1 net.lundman.spl : _bmalloc + 0x21
0xffffff81f6df3320 : 0xffffff7f8d13cbb6 net.lundman.spl : _zfs_kmem_alloc + 0x16
0xffffff81f6df3350 : 0xffffff7f8d150b38 net.lundman.zfs : _arc_read + 0x858
0xffffff81f6df34a0 : 0xffffff7f8d15e57b net.lundman.zfs : _dbuf_prefetch + 0x59b
0xffffff81f6df35e0 : 0xffffff7f8d17f7a8 net.lundman.zfs : _dmu_zfetch_fetch + 0x58
0xffffff81f6df3620 : 0xffffff7f8d17f6f0 net.lundman.zfs : _dmu_zfetch_dofetch + 0x1d0
0xffffff81f6df3690 : 0xffffff7f8d17ee2d net.lundman.zfs : _dmu_zfetch_find + 0x80d
0xffffff81f6df3730 : 0xffffff7f8d17e2af net.lundman.zfs : _dmu_zfetch + 0x11f
0xffffff81f6df3820 : 0xffffff7f8d15b99c net.lundman.zfs : _dbuf_read + 0x1bc
0xffffff81f6df3880 : 0xffffff7f8d181ed6 net.lundman.zfs : _dnode_hold_impl + 0x216
0xffffff81f6df3940 : 0xffffff7f8d182434 net.lundman.zfs : _dnode_hold + 0x34
0xffffff81f6df3970 : 0xffffff7f8d167036 net.lundman.zfs : _dmu_bonus_hold + 0x36
0xffffff81f6df39c0 : 0xffffff7f8d18e4e4 net.lundman.zfs : _dsl_deadlist_open + 0x74
0xffffff81f6df3a20 : 0xffffff7f8d186ced net.lundman.zfs : _dsl_dataset_hold_obj + 0x1bd
0xffffff81f6df3ad0 : 0xffffff7f8d187342 net.lundman.zfs : _dsl_dataset_hold + 0x92
0xffffff81f6df3b40 : 0xffffff7f8d16d178 net.lundman.zfs : _dmu_objset_hold + 0x58
0xffffff81f6df3b90 : 0xffffff7f8d211640 net.lundman.zfs : _zfs_ioc_objset_stats + 0x20
0xffffff81f6df3bc0 : 0xffffff7f8d21199e net.lundman.zfs : _zfs_ioc_dataset_list_next + 0x19e
0xffffff81f6df3c20 : 0xffffff7f8d20d6b4 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff81f6df3cf0 : 0xffffff800b60d92f mach_kernel : _spec_ioctl + 0x11f
0xffffff81f6df3d40 : 0xffffff800b5fe2f0 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff81f6df3dc0 : 0xffffff800b5f4141 mach_kernel : _utf8_normalizestr + 0x971
0xffffff81f6df3e10 : 0xffffff800b7c16a3 mach_kernel : _fo_ioctl + 0x43
0xffffff81f6df3e40 : 0xffffff800b7f3026 mach_kernel : _ioctl + 0x466
0xffffff81f6df3f50 : 0xffffff800b840a33 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff81f6df3fb0 : 0xffffff800b4f3f46 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.spl(1.0)[43EDA1D3-DBA9-3A07-8F9A-ED4F1052E714]@0xffffff7f8d13b000->0xffffff7f8d14bfff
         net.lundman.zfs(1.0)[9C932B4D-5EDB-3D80-81C6-EF228011F6D8]@0xffffff7f8d14c000->0xffffff7f8d358fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f8b9f3000
            dependency: net.lundman.spl(1.0.0)[43EDA1D3-DBA9-3A07-8F9A-ED4F1052E714]@0xffffff7f8d13b000

BSD process name corresponding to current thread: zfs
Boot args: -v keepsyms=y darkwake=0

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jul  8 11:32:06 2014
panic(cpu 2 caller 0xffffff7fa834139a): "returning same addy twice: 0xffffff8215c49d48\n"@spl-bmalloc.c:854
Backtrace (CPU 2), Frame : Return Address
0xffffff82115c37f0 : 0xffffff8026622f79 mach_kernel : _panic + 0xc9
0xffffff82115c3870 : 0xffffff7fa834139a net.lundman.spl : _slice_allocator_alloc + 0x17a
0xffffff82115c38b0 : 0xffffff7fa83418b1 net.lundman.spl : _bmalloc + 0x21
0xffffff82115c38d0 : 0xffffff7fa833cbb6 net.lundman.spl : _zfs_kmem_alloc + 0x16
0xffffff82115c3900 : 0xffffff7fa8463530 net.lundman.zfs : _nv_alloc_sleep_spl + 0x20
0xffffff82115c3920 : 0xffffff7fa845c6fc net.lundman.zfs : _nv_mem_zalloc + 0x3c
0xffffff82115c3960 : 0xffffff7fa845c593 net.lundman.zfs : _nvlist_xalloc + 0x73
0xffffff82115c39a0 : 0xffffff7fa845c519 net.lundman.zfs : _nvlist_alloc + 0x39
0xffffff82115c39d0 : 0xffffff7fa839d5b1 net.lundman.zfs : _dsl_prop_get_all_ds + 0x61
0xffffff82115c3b30 : 0xffffff7fa839d545 net.lundman.zfs : _dsl_prop_get_all + 0x25
0xffffff82115c3b50 : 0xffffff7fa8414d5d net.lundman.zfs : _zfs_ioc_objset_stats_impl + 0x4d
0xffffff82115c3b90 : 0xffffff7fa841165d net.lundman.zfs : _zfs_ioc_objset_stats + 0x3d
0xffffff82115c3bc0 : 0xffffff7fa841199e net.lundman.zfs : _zfs_ioc_dataset_list_next + 0x19e
0xffffff82115c3c20 : 0xffffff7fa840d6b4 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff82115c3cf0 : 0xffffff802680d92f mach_kernel : _spec_ioctl + 0x11f
0xffffff82115c3d40 : 0xffffff80267fe2f0 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff82115c3dc0 : 0xffffff80267f4141 mach_kernel : _utf8_normalizestr + 0x971
0xffffff82115c3e10 : 0xffffff80269c16a3 mach_kernel : _fo_ioctl + 0x43
0xffffff82115c3e40 : 0xffffff80269f3026 mach_kernel : _ioctl + 0x466
0xffffff82115c3f50 : 0xffffff8026a40a33 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff82115c3fb0 : 0xffffff80266f3f46 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.spl(1.0)[43EDA1D3-DBA9-3A07-8F9A-ED4F1052E714]@0xffffff7fa833b000->0xffffff7fa834bfff
         net.lundman.zfs(1.0)[9C932B4D-5EDB-3D80-81C6-EF228011F6D8]@0xffffff7fa834c000->0xffffff7fa8558fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7fa6bf3000
            dependency: net.lundman.spl(1.0.0)[43EDA1D3-DBA9-3A07-8F9A-ED4F1052E714]@0xffffff7fa833b000

BSD process name corresponding to current thread: zfs
Boot args: -v keepsyms=y darkwake=0

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jul  8 11:34:09 2014
panic(cpu 2 caller 0xffffff7f9014139a): "returning same addy twice: 0xffffff81fc211798\n"@spl-bmalloc.c:854
Backtrace (CPU 2), Frame : Return Address
0xffffff821c47b820 : 0xffffff800e422f79 mach_kernel : _panic + 0xc9
0xffffff821c47b8a0 : 0xffffff7f9014139a net.lundman.spl : _slice_allocator_alloc + 0x17a
0xffffff821c47b8e0 : 0xffffff7f901418b1 net.lundman.spl : _bmalloc + 0x21
0xffffff821c47b900 : 0xffffff7f9013cbb6 net.lundman.spl : _zfs_kmem_alloc + 0x16
0xffffff821c47b930 : 0xffffff7f90263530 net.lundman.zfs : _nv_alloc_sleep_spl + 0x20
0xffffff821c47b950 : 0xffffff7f9025c6fc net.lundman.zfs : _nv_mem_zalloc + 0x3c
0xffffff821c47b990 : 0xffffff7f9025c593 net.lundman.zfs : _nvlist_xalloc + 0x73
0xffffff821c47b9d0 : 0xffffff7f9025c519 net.lundman.zfs : _nvlist_alloc + 0x39
0xffffff821c47ba00 : 0xffffff7f901ba345 net.lundman.zfs : _spa_prop_get + 0x45
0xffffff821c47bbf0 : 0xffffff7f90210b29 net.lundman.zfs : _zfs_ioc_pool_get_props + 0x89
0xffffff821c47bc20 : 0xffffff7f9020d6b4 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff821c47bcf0 : 0xffffff800e60d92f mach_kernel : _spec_ioctl + 0x11f
0xffffff821c47bd40 : 0xffffff800e5fe2f0 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff821c47bdc0 : 0xffffff800e5f4141 mach_kernel : _utf8_normalizestr + 0x971
0xffffff821c47be10 : 0xffffff800e7c16a3 mach_kernel : _fo_ioctl + 0x43
0xffffff821c47be40 : 0xffffff800e7f3026 mach_kernel : _ioctl + 0x466
0xffffff821c47bf50 : 0xffffff800e840a33 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff821c47bfb0 : 0xffffff800e4f3f46 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.spl(1.0)[43EDA1D3-DBA9-3A07-8F9A-ED4F1052E714]@0xffffff7f9013b000->0xffffff7f9014bfff
         net.lundman.zfs(1.0)[9C932B4D-5EDB-3D80-81C6-EF228011F6D8]@0xffffff7f9014c000->0xffffff7f90358fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f8e9f3000
            dependency: net.lundman.spl(1.0.0)[43EDA1D3-DBA9-3A07-8F9A-ED4F1052E714]@0xffffff7f9013b000

BSD process name corresponding to current thread: zpool

@rottegift
Copy link
Contributor Author

I've changed the returning same addy twice panic to a printf; it does not happen enormously often so far, mostly during dataset mounts.

(edit: "zfs list" on its own triggers one "returning same addy twice" per dataset)

@lundman
Copy link
Contributor

lundman commented Jul 8, 2014

It is the zfs send panics, we are returning memory, again, that is not free. so anything can happen. tracking it down now

@rottegift
Copy link
Contributor Author

@ilovezfs I think the panics are now taken care of with the spl changes. Should we close this and migrate the zfs.fs / importing / zpool-status-shows-wrong-devices stuff to a new issue?

@ilovezfs
Copy link
Contributor

@rottegift perhaps we should use good-old #167 ?

@rottegift
Copy link
Contributor Author

@ilovezfs Sure. It's easy and not too high impact to arbitrarily physically detach and re-attach the cache vdevs which are probably the worst case for renumbering and recovering from that; removing one of the three-way-mirror disks is also not too disruptive and is also likely to lead to renumbering.

That also seems to fit with taking several minutes at system startup to probe and reprobe all the disks while rarely actually importing non-trivial pools and sometimes getting the device names wrong on manual import.

@ilovezfs
Copy link
Contributor

@rottegift
Copy link
Contributor Author

Sure, I'll do so a bit later today. (I'm mainly waiting on a large zfs send/recv).

@rottegift
Copy link
Contributor Author

I got a double panic during the export phase when shutting down to reboot with the delayprobe branch after about 1d14h uptime without problem. Sadly the panic report is entirely useless.

@lundman
Copy link
Contributor

lundman commented Jul 10, 2014

During export? Probably unrelated/new to what we were seeing before at least

@rottegift
Copy link
Contributor Author

@ilovezfs - delayprobe branch imported everything correctly -- all devices ONLINE, all log vdevs accumulating data per zpool iostat -v. Nice.

All pools imported (it would be nice to have a way to exclude pools from automatic import somehow) and in arbitrary serial order. Three minutes from nothing to all-imported-OK though is pretty good.

@rottegift
Copy link
Contributor Author

@lundman : well, hopefully it'll recur. Late during shutdown is the best time for a panic, although double panics are always gonna be guesswork.

@ilovezfs
Copy link
Contributor

@rottegift delayprobe isn't doing anything about renumbering after we're already imported. But what it should do, in addition to the import -a during boot, is import a pool if you connect it later. As for limiting what pools get imported, if you change the command in zpool-import-all.sh from zpool import -a to zpool import -c /etc/zfs/zpool.cache -a that may give you what you're looking for. And you can limit what would be imported later if you connect the pool later, by #define'ing ZFS_AUTOIMPORT_ZPOOL_CACHE_ONLY in zfs_util.c.

@ilovezfs
Copy link
Contributor

@rottegift I made the changes I suggested, so that you can try that out:
8e0c3c6

@rottegift
Copy link
Contributor Author

Ok the commit is pretty straightforward (I'm glad you changed the #ifndef DEBUG logic, defining DEBUG by hand in zfs/cmd/zfs_util/zfs_util.c is easy). I'll try it now and hope It Just Works because I won't be able to spend much time with it for the next 14 hours or so.

@rottegift
Copy link
Contributor Author

@lundman : one normal shutdown since then and now one waiting with blinking lights showing activity on one pool with "ZFS: Reclaim thread is being slow (20001)" logged to the console shortly after verbose shutdown printed "continuing" (which was more than fifteen minutes ago).

Not a panic but it's an a awfully long time to shut down cleanly.

@rottegift
Copy link
Contributor Author

Got bored, forced reboot.

@ilovezfs : One of the four pools in the cachefile imported in DEGRADED state, the rest did not import at all.

$ zpool status -v
  pool: Donkey
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
    the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://zfsonlinux.org/msg/ZFS-8000-2Q
  scan: resilvered 165M in 0h48m with 0 errors on Sat Jul  5 15:37:25 2014
config:

    NAME                      STATE     READ WRITE CKSUM
    Donkey                    DEGRADED     0     0     0
      mirror-0                DEGRADED     0     0     0
        disk19s2              ONLINE       0     0     0
        disk20s2              ONLINE       0     0     0
        4776926607176438668   FAULTED      0     0     0  was /dev/disk23s2
    logs
      mirror-1                DEGRADED     0     0     0
        18176774232874365893  UNAVAIL      0     0     0  was /dev/disk16s1
        disk14s1              ONLINE       0     0     0
    cache
      disk38s1                UNAVAIL      0     0     0  cannot open

errors: No known data errors

debug log entries for zfs.util at

https://gist.github.com/rottegift/50655293c0da077ce765

@rottegift
Copy link
Contributor Author

With ssdpool imported and up OK and Donkey still in the DEGRADED state above, did a shutdown -r now.

When the system came back up, nothing got imported at all, and these are the only zfs.util log entries.

11/07/2014 11:31:22.307 zfs.util[624]: argv[0]: /System/Library/Filesystems/zfs.fs/Contents/Resources/./zfs.util
11/07/2014 11:31:22.308 zfs.util[624]: argv[1]: -p
11/07/2014 11:31:22.308 zfs.util[624]: argv[2]: disk34s2
11/07/2014 11:31:22.308 zfs.util[624]: argv[3]: removable
11/07/2014 11:31:22.308 zfs.util[624]: argv[4]: readonly
11/07/2014 11:31:22.308 zfs.util[624]: zfs.util called with option p
11/07/2014 11:31:22.308 zfs.util[624]: blockdevice is /dev/disk34s2
11/07/2014 11:31:22.320 zfs.util[628]: argv[0]: /System/Library/Filesystems/zfs.fs/Contents/Resources/./zfs.util
11/07/2014 11:31:22.320 zfs.util[628]: argv[1]: -k
11/07/2014 11:31:22.321 zfs.util[628]: argv[2]: disk34s2
11/07/2014 11:31:22.321 zfs.util[628]: zfs.util called with option k
11/07/2014 11:31:22.321 zfs.util[628]: blockdevice is /dev/disk34s2
11/07/2014 11:31:22.321 zfs.util[628]: FSUC_GETUUID
 diskutil list disk34
/dev/disk34
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *3.0 TB     disk34
   1:                        EFI EFI                     209.7 MB   disk34s1
   2:                        ZFS                         3.0 TB     disk34s2

Subsequent "zpool import -N {poolname}" is doing the right thing.

@rottegift
Copy link
Contributor Author

Oops, correction. There were more entries in syslog that weren't being shown by Console app. I should know better, sorry.

https://gist.github.com/rottegift/6b7919ab3cf168a849d0

@ilovezfs
Copy link
Contributor

@rottegift Yeah, this is not mysterious. It turns out that zpool import -c /etc/zpool.cache -a only does a verbatim import, so looks like zpool import -a will be the way to go for now, with no means of specifying pools that should not be imported. Perhaps at a later date, we can consider such an enhancement if there's significant interest. So the remaining decision is whether, subsequent to the initial zpool import -a during boot, to permit zfs.util to probe and import any pool that gets connected later, or only pools named in the cache file.

As a side note, it looks like there may be a bug with zpool import -c when the disk numbers don't match the verbatim disk numbers in the cache file. With it expecting disk5, but the device actually at disk6, check this out:

joe-vm:~ joe$ sudo zpool import -c /etc/zfs/zpool.cache -a
internal error: Value too large to be stored in data type

It's possible it's not a bug per se, in that it's try to find the pool on disk5 (all zeros), but it would be nice to at least have a better error. I suspect this is an upstream issue.

@ilovezfs
Copy link
Contributor

@rottegift Ah, so that bug is caused by the disk5 placeholder being too small (I made it 32 M). Using a larger placeholder for disk5, I get

joe-vm:~ joe$ sudo zpool import -c /etc/zfs/zpool.cache -a
cannot import 'foo': one or more devices is currently unavailable

as you'd expect.

@rottegift
Copy link
Contributor Author

@ilovezfs - I can live with "try to import everything attached at boot time" for now (and until O3X catches up with the latest illumos feature flags, which likely will come from ZoL in some weeks or months).

However, I would really like an easy way to not import a pool whose devices are attached later. The particular usage case in mind is when attaching physical disks that are meant to be plumbed up to a virtual machine, but which are importable-in-principle by O3X on the native machine. (I figure that you guys might like that too for if and when you test with vdevs on physical devices rather than files and zvols).

@ilovezfs
Copy link
Contributor

@rottegift Well, limiting the auto-import-laters to only pools listed in cache file should take care of that, no?

@rottegift
Copy link
Contributor Author

@ilovezfs Yes, I think that would be fine.

@ilovezfs
Copy link
Contributor

@rottegift The changes we discussed are merged to master now.

@rottegift
Copy link
Contributor Author

I don't think this needs to be open. If any of the various things above become live problems again, I'll make a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants