SPL PANIC when deleting a snapshot #1499

atonkyra · 2013-06-05T01:29:38Z

Encountered a SPL panic when snapshot was deleted, Platform is on Ubuntu 12.04 and zfsonlinux version is some 4/5 months old, I can add details later. Dmesg:


[8419543.548971] SPLError: 4802:0:(space_map.c:109:space_map_add()) SPL PANIC
[8419543.548972] SPL: Showing stack for process 4802
[8419543.548974] Pid: 4802, comm: z_fr_iss/8 Tainted: G        W    3.7.9-fsolstorage+2 #4
[8419543.548974] Call Trace:
[8419543.548980]  [] ? spl_debug_dumpstack+0x1d/0x40
[8419543.548982]  [] ? spl_debug_bug+0x73/0xd0
[8419543.548987]  [] ? space_map_add+0xee/0x3b0
[8419543.548991]  [] ? __mutex_lock_slowpath+0x56/0x150
[8419543.548993]  [] ? __mutex_lock_slowpath+0x56/0x150
[8419543.548997]  [] ? metaslab_free_dva+0x125/0x200
[8419543.548999]  [] ? metaslab_free+0x84/0xb0
[8419543.549002]  [] ? zio_dva_free+0x17/0x30
[8419543.549004]  [] ? zio_execute+0x95/0x100
[8419543.549006]  [] ? taskq_thread+0x216/0x4c0
[8419543.549009]  [] ? try_to_wake_up+0x2a0/0x2a0
[8419543.549013]  [] ? task_expire+0x110/0x110
[8419543.549015]  [] ? task_expire+0x110/0x110
[8419543.549018]  [] ? kthread+0xce/0xe0
[8419543.549020]  [] ? kthread_parkme+0x30/0x30
[8419543.549023]  [] ? ret_from_fork+0x7c/0xb0
[8419543.549025]  [] ? kthread_parkme+0x30/0x30
[8419543.549100] SPLError: 4801:0:(space_map.c:95:space_map_add()) SPL PANIC
[8419543.549143] SPL: Showing stack for process 4801
[8419543.549145] Pid: 4801, comm: z_fr_iss/7 Tainted: G        W    3.7.9-fsolstorage+2 #4
[8419543.549147] Call Trace:
[8419543.549151]  [] ? spl_debug_dumpstack+0x1d/0x40
[8419543.549154]  [] ? spl_debug_bug+0x73/0xd0
[8419543.549157]  [] ? space_map_add+0x2ce/0x3b0
[8419543.549160]  [] ? kmalloc_nofail+0x28/0xc0
[8419543.549163]  [] ? __mutex_lock_slowpath+0x56/0x150
[8419543.549165]  [] ? __mutex_lock_slowpath+0x56/0x150
[8419543.549168]  [] ? metaslab_free_dva+0x125/0x200
[8419543.549170]  [] ? metaslab_free+0x84/0xb0
[8419543.549173]  [] ? zio_dva_free+0x17/0x30
[8419543.549175]  [] ? zio_execute+0x95/0x100
[8419543.549177]  [] ? taskq_thread+0x216/0x4c0
[8419543.549181]  [] ? try_to_wake_up+0x2a0/0x2a0
[8419543.549183]  [] ? task_expire+0x110/0x110
[8419543.549185]  [] ? task_expire+0x110/0x110
[8419543.549188]  [] ? kthread+0xce/0xe0
[8419543.549190]  [] ? kthread_parkme+0x30/0x30
[8419543.549193]  [] ? ret_from_fork+0x7c/0xb0
[8419543.549195]  [] ? kthread_parkme+0x30/0x30

The text was updated successfully, but these errors were encountered:

atonkyra · 2013-06-05T01:31:04Z

The pool is now unimportable due to the same kind of panic
[8419543.548962] VERIFY(P2PHASE(size, 1ULL << sm->sm_shift) == 0) failed [8419543.548967] VERIFY(ss == NULL) failed [8419543.548971] SPLError: 4802:0:(space_map.c:109:space_map_add()) SPL PANIC

atonkyra · 2013-06-05T01:58:53Z

zdb -lu /some/disk requested by ryao at
http://d.adm.fi/zol-splerror

also additional screencaps
http://d.adm.fi/ro.png
http://d.adm.fi/rw.png

mailinglists35 · 2013-06-05T13:03:06Z

ZoL is quite dynamic in development. 4-5 months old might be too old. Give developers a chance to help you by updating first to latest -daily code!
Also, some code is pulled from illumos, the headline of zfs fork since oracle closed the source. To check whether the bug is upstream, try accessing the pool from a recent illumos-based distribution such as smartos. full list here: http://wiki.illumos.org/display/illumos/Distributions
You'll have to figure out which one is most up to date but for example Dilos is dated Apr 7, 2013. I'm sure there are others even more recent

ryao · 2013-06-05T16:53:49Z

I had a chat with @duidalus and one of his colleges about this issue last night. They were able to do an import using an OmniOS LiveCD. They used either OmniOS_Text_r151006c.iso or OmniOS_Text_bloody_20130208.iso; I am not sure which (clarification by @duidalus would be helpful). They did a readonly import without issue on OmniOS. I am not sure if that is because OmniOS was built without assertions or because Illumos has made improvements to its import code. The last we spoke, they were preparing to copy their data off the pool.

From our chat last night, I am under the impression that the system crashed (or hung and was rebooted) while snapshots were being deleted in rapid succession. If the deletion was concurrent, then this might have been related to issue #1495. @duidalus would need to clarify these things. My guess is that the barrier regression fixed by d9b0ebb caused pool corruption following a crash or hard reboot.

atonkyra · 2013-06-05T17:23:34Z

Basically the flow of events started from deleting snapshots in rapid succession and then the SPL panic hit. I am moderately sure that our ZFS version was recent enough to include d9b0ebb but cannot confirm it just yet.

We were able to mount the pool in OmniOS as read-only (read-write caused same kind of assertion like the one we had prior to the SPL panic).

behlendorf · 2013-06-19T17:57:56Z

Depending on how old the code was this might also be related to #541 which was fixed. Were you able to import and mount the pool read-only in Linux as well, or just on OmniOS?

atonkyra · 2013-06-19T20:55:57Z

Only on OmniOS, SPL seemed to have an issue mounting the pool read-only. The ASSERT that triggered on readonly import is in this screencap: http://d.adm.fi/ro.png

atonkyra · 2013-06-19T21:24:09Z

Okay, after unpacking the zimage of the used kernel I am fairly certain the SPL version was v0.6.0-rc14 which may have also been the ZFS version (the tag was before the barrier fix came in).

In any case we were able to recover our pool by moving the data into a newly created one. Only real issue with this operation was the analysis of what happened and how to recover (which was partially made harder due to the failed ro-import in Linux).

behlendorf · 2013-06-21T20:40:51Z

@duidalus The assert triggered when importing the pool on Linux read-only was fixed 4 months ago, see issue #1332. The fix is in the 0.6.1 tag, were you using an older version?

atonkyra · 2013-06-21T22:44:09Z

Yeah we were using version prior to that.

I, however, examined the sources for the kernel used and found out that they did include d9b0ebb fix so it is doubtful to be the cause of the initial problem. I guess #1495 could be the source of the problems but I have no real idea on that.

I'm not sure if it is all relevant but the pool did have some corrupt data from it's solaris days before we moved it onto zfsonlinux.

behlendorf · 2013-06-21T22:54:09Z

@duidalus Depending on the corruption it could be responsible for the space_map_add assert above. Hard to say for sure though.

FransUrbo · 2014-06-11T16:46:22Z

@behlendorf Considering we haven't heard from @atonkyra in a year, close this as stale?

atonkyra · 2014-06-11T23:27:47Z

I'm OK with that since the pool is long gone (thus impossible to debug any more).

behlendorf closed this as completed Jun 12, 2014

behlendorf modified the milestone: 0.6.5 Nov 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPL PANIC when deleting a snapshot #1499

SPL PANIC when deleting a snapshot #1499

atonkyra commented Jun 5, 2013

atonkyra commented Jun 5, 2013

atonkyra commented Jun 5, 2013

mailinglists35 commented Jun 5, 2013

ryao commented Jun 5, 2013

atonkyra commented Jun 5, 2013

behlendorf commented Jun 19, 2013

atonkyra commented Jun 19, 2013

atonkyra commented Jun 19, 2013

behlendorf commented Jun 21, 2013

atonkyra commented Jun 21, 2013

behlendorf commented Jun 21, 2013

FransUrbo commented Jun 11, 2014

atonkyra commented Jun 11, 2014

SPL PANIC when deleting a snapshot #1499

SPL PANIC when deleting a snapshot #1499

Comments

atonkyra commented Jun 5, 2013

atonkyra commented Jun 5, 2013

atonkyra commented Jun 5, 2013

mailinglists35 commented Jun 5, 2013

ryao commented Jun 5, 2013

atonkyra commented Jun 5, 2013

behlendorf commented Jun 19, 2013

atonkyra commented Jun 19, 2013

atonkyra commented Jun 19, 2013

behlendorf commented Jun 21, 2013

atonkyra commented Jun 21, 2013

behlendorf commented Jun 21, 2013

FransUrbo commented Jun 11, 2014

atonkyra commented Jun 11, 2014