Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPL PANIC when deleting a snapshot #1499

Closed
atonkyra opened this issue Jun 5, 2013 · 13 comments
Closed

SPL PANIC when deleting a snapshot #1499

atonkyra opened this issue Jun 5, 2013 · 13 comments

Comments

@atonkyra
Copy link

atonkyra commented Jun 5, 2013

Encountered a SPL panic when snapshot was deleted, Platform is on Ubuntu 12.04 and zfsonlinux version is some 4/5 months old, I can add details later. Dmesg:

[8419543.548971] SPLError: 4802:0:(space_map.c:109:space_map_add()) SPL PANIC [8419543.548972] SPL: Showing stack for process 4802 [8419543.548974] Pid: 4802, comm: z_fr_iss/8 Tainted: G W 3.7.9-fsolstorage+2 #4 [8419543.548974] Call Trace: [8419543.548980] [] ? spl_debug_dumpstack+0x1d/0x40 [8419543.548982] [] ? spl_debug_bug+0x73/0xd0 [8419543.548987] [] ? space_map_add+0xee/0x3b0 [8419543.548991] [] ? __mutex_lock_slowpath+0x56/0x150 [8419543.548993] [] ? __mutex_lock_slowpath+0x56/0x150 [8419543.548997] [] ? metaslab_free_dva+0x125/0x200 [8419543.548999] [] ? metaslab_free+0x84/0xb0 [8419543.549002] [] ? zio_dva_free+0x17/0x30 [8419543.549004] [] ? zio_execute+0x95/0x100 [8419543.549006] [] ? taskq_thread+0x216/0x4c0 [8419543.549009] [] ? try_to_wake_up+0x2a0/0x2a0 [8419543.549013] [] ? task_expire+0x110/0x110 [8419543.549015] [] ? task_expire+0x110/0x110 [8419543.549018] [] ? kthread+0xce/0xe0 [8419543.549020] [] ? kthread_parkme+0x30/0x30 [8419543.549023] [] ? ret_from_fork+0x7c/0xb0 [8419543.549025] [] ? kthread_parkme+0x30/0x30 [8419543.549100] SPLError: 4801:0:(space_map.c:95:space_map_add()) SPL PANIC [8419543.549143] SPL: Showing stack for process 4801 [8419543.549145] Pid: 4801, comm: z_fr_iss/7 Tainted: G W 3.7.9-fsolstorage+2 #4 [8419543.549147] Call Trace: [8419543.549151] [] ? spl_debug_dumpstack+0x1d/0x40 [8419543.549154] [] ? spl_debug_bug+0x73/0xd0 [8419543.549157] [] ? space_map_add+0x2ce/0x3b0 [8419543.549160] [] ? kmalloc_nofail+0x28/0xc0 [8419543.549163] [] ? __mutex_lock_slowpath+0x56/0x150 [8419543.549165] [] ? __mutex_lock_slowpath+0x56/0x150 [8419543.549168] [] ? metaslab_free_dva+0x125/0x200 [8419543.549170] [] ? metaslab_free+0x84/0xb0 [8419543.549173] [] ? zio_dva_free+0x17/0x30 [8419543.549175] [] ? zio_execute+0x95/0x100 [8419543.549177] [] ? taskq_thread+0x216/0x4c0 [8419543.549181] [] ? try_to_wake_up+0x2a0/0x2a0 [8419543.549183] [] ? task_expire+0x110/0x110 [8419543.549185] [] ? task_expire+0x110/0x110 [8419543.549188] [] ? kthread+0xce/0xe0 [8419543.549190] [] ? kthread_parkme+0x30/0x30 [8419543.549193] [] ? ret_from_fork+0x7c/0xb0 [8419543.549195] [] ? kthread_parkme+0x30/0x30
@atonkyra
Copy link
Author

atonkyra commented Jun 5, 2013

The pool is now unimportable due to the same kind of panic

[8419543.548962] VERIFY(P2PHASE(size, 1ULL << sm->sm_shift) == 0) failed
[8419543.548967] VERIFY(ss == NULL) failed
[8419543.548971] SPLError: 4802:0:(space_map.c:109:space_map_add()) SPL PANIC

@atonkyra
Copy link
Author

atonkyra commented Jun 5, 2013

zdb -lu /some/disk requested by ryao at
http://d.adm.fi/zol-splerror

also additional screencaps
http://d.adm.fi/ro.png
http://d.adm.fi/rw.png

@mailinglists35
Copy link

ZoL is quite dynamic in development. 4-5 months old might be too old. Give developers a chance to help you by updating first to latest -daily code!
Also, some code is pulled from illumos, the headline of zfs fork since oracle closed the source. To check whether the bug is upstream, try accessing the pool from a recent illumos-based distribution such as smartos. full list here: http://wiki.illumos.org/display/illumos/Distributions
You'll have to figure out which one is most up to date but for example Dilos is dated Apr 7, 2013. I'm sure there are others even more recent

@ryao
Copy link
Contributor

ryao commented Jun 5, 2013

I had a chat with @duidalus and one of his colleges about this issue last night. They were able to do an import using an OmniOS LiveCD. They used either OmniOS_Text_r151006c.iso or OmniOS_Text_bloody_20130208.iso; I am not sure which (clarification by @duidalus would be helpful). They did a readonly import without issue on OmniOS. I am not sure if that is because OmniOS was built without assertions or because Illumos has made improvements to its import code. The last we spoke, they were preparing to copy their data off the pool.

From our chat last night, I am under the impression that the system crashed (or hung and was rebooted) while snapshots were being deleted in rapid succession. If the deletion was concurrent, then this might have been related to issue #1495. @duidalus would need to clarify these things. My guess is that the barrier regression fixed by d9b0ebb caused pool corruption following a crash or hard reboot.

@atonkyra
Copy link
Author

atonkyra commented Jun 5, 2013

Basically the flow of events started from deleting snapshots in rapid succession and then the SPL panic hit. I am moderately sure that our ZFS version was recent enough to include d9b0ebb but cannot confirm it just yet.

We were able to mount the pool in OmniOS as read-only (read-write caused same kind of assertion like the one we had prior to the SPL panic).

@behlendorf
Copy link
Contributor

Depending on how old the code was this might also be related to #541 which was fixed. Were you able to import and mount the pool read-only in Linux as well, or just on OmniOS?

@atonkyra
Copy link
Author

Only on OmniOS, SPL seemed to have an issue mounting the pool read-only. The ASSERT that triggered on readonly import is in this screencap: http://d.adm.fi/ro.png

@atonkyra
Copy link
Author

Okay, after unpacking the zimage of the used kernel I am fairly certain the SPL version was v0.6.0-rc14 which may have also been the ZFS version (the tag was before the barrier fix came in).

In any case we were able to recover our pool by moving the data into a newly created one. Only real issue with this operation was the analysis of what happened and how to recover (which was partially made harder due to the failed ro-import in Linux).

@behlendorf
Copy link
Contributor

@duidalus The assert triggered when importing the pool on Linux read-only was fixed 4 months ago, see issue #1332. The fix is in the 0.6.1 tag, were you using an older version?

@atonkyra
Copy link
Author

Yeah we were using version prior to that.

I, however, examined the sources for the kernel used and found out that they did include d9b0ebb fix so it is doubtful to be the cause of the initial problem. I guess #1495 could be the source of the problems but I have no real idea on that.

I'm not sure if it is all relevant but the pool did have some corrupt data from it's solaris days before we moved it onto zfsonlinux.

@behlendorf
Copy link
Contributor

@duidalus Depending on the corruption it could be responsible for the space_map_add assert above. Hard to say for sure though.

@FransUrbo
Copy link
Contributor

@behlendorf Considering we haven't heard from @atonkyra in a year, close this as stale?

@atonkyra
Copy link
Author

I'm OK with that since the pool is long gone (thus impossible to debug any more).

@behlendorf behlendorf modified the milestone: 0.6.5 Nov 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants