Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASSERT panic in zpool_upgrade_004_pos.ksh #6141

Closed
tcaputi opened this issue May 18, 2017 · 3 comments
Closed

ASSERT panic in zpool_upgrade_004_pos.ksh #6141

tcaputi opened this issue May 18, 2017 · 3 comments
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Stale No recent activity for issue

Comments

@tcaputi
Copy link
Contributor

tcaputi commented May 18, 2017

System information

Type Version/Name
Distribution Name
Distribution Version
Linux Kernel
Architecture
ZFS Version
SPL Version

Describe the problem you're observing

zpool_upgrade_004_pos.ksh spuriously triggers an ASSERT on buildbot.

Describe how to reproduce the problem

Problem can be reproduced via automated buildbot testing occasionally and by running the zpool_upgrade zfs-tests in a loop if you're lucky.

Include any warning/errors/backtraces from the system logs

[ 8477.027346] VERIFY3(offset + size <= sm->sm_start + sm->sm_size) failed (6755399466238464 <= 25690112)
[ 8477.032642] PANIC at space_map.c:115:space_map_load()
[ 8477.035452] Showing stack for process 3322
[ 8477.035456] CPU: 1 PID: 3322 Comm: z_wr_iss Tainted: P           OE   4.9.2-kmemleak #1
[ 8477.035457] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
[ 8477.035459]  ffffae1dcf03f958 ffffffff98619156 ffffffffc1052519 0000000000000073
[ 8477.035464]  ffffae1dcf03f968 ffffffffc0c65802 ffffae1dcf03faf0 ffffffffc0c65ad9
[ 8477.035468]  ffff9556dcc07dc0 ffffae1d00000030 ffffae1dcf03fb00 ffffae1dcf03faa0
[ 8477.035472] Call Trace:
[ 8477.035480]  [<ffffffff98619156>] dump_stack+0xb0/0xea
[ 8477.035495]  [<ffffffffc0c65802>] spl_dumpstack+0x62/0x70 [spl]
[ 8477.035503]  [<ffffffffc0c65ad9>] spl_panic+0xd9/0x120 [spl]
[ 8477.035509]  [<ffffffff98c9e387>] ? ftrace_call+0x5/0x34
[ 8477.035518]  [<ffffffffc0c65a05>] ? spl_panic+0x5/0x120 [spl]
[ 8477.035635]  [<ffffffffc0ec7986>] space_map_load+0x3d6/0x7e0 [zfs]
[ 8477.035713]  [<ffffffffc0e9190b>] metaslab_load+0x7b/0x280 [zfs]
[ 8477.035789]  [<ffffffffc0e91e78>] metaslab_activate+0x128/0x1e0 [zfs]
[ 8477.035864]  [<ffffffffc0e95b22>] metaslab_alloc+0x952/0x1f10 [zfs]
[ 8477.035945]  [<ffffffffc0f58b37>] zio_dva_allocate+0x217/0xc70 [zfs]
[ 8477.035949]  [<ffffffff98c9e387>] ? ftrace_call+0x5/0x34
[ 8477.035958]  [<ffffffffc0c6aa42>] ? tsd_hash_search.isra.0+0x52/0xf0 [spl]
[ 8477.036037]  [<ffffffffc0f58925>] ? zio_dva_allocate+0x5/0xc70 [zfs]
[ 8477.036046]  [<ffffffffc0c60290>] ? taskq_member+0x20/0x40 [spl]
[ 8477.036146]  [<ffffffffc0f4fea9>] zio_execute+0x149/0x3f0 [zfs]
[ 8477.036156]  [<ffffffffc0c617ed>] taskq_thread+0x36d/0x760 [spl]
[ 8477.036161]  [<ffffffff9810bc80>] ? try_to_wake_up+0x680/0x680
[ 8477.036169]  [<ffffffffc0c61480>] ? taskq_thread_spawn+0x80/0x80 [spl]
[ 8477.036172]  [<ffffffff980fabd9>] kthread+0x119/0x150
[ 8477.036175]  [<ffffffff980faac0>] ? kthread_flush_work+0x1f0/0x1f0
[ 8477.036177]  [<ffffffff98c9ccf5>] ret_from_fork+0x25/0x30

This refers to this bit of code in space_map_load():

	for (entry = entry_map; entry < entry_map_end; entry++) {
		uint64_t e = *entry;
		uint64_t offset, size;

		if (SM_DEBUG_DECODE(e))		/* Skip debug entries */
			continue;

		offset = (SM_OFFSET_DECODE(e) << sm->sm_shift) +
		    sm->sm_start;
		size = SM_RUN_DECODE(e) << sm->sm_shift;

		VERIFY0(P2PHASE(offset, 1ULL << sm->sm_shift));
		VERIFY0(P2PHASE(size, 1ULL << sm->sm_shift));
		VERIFY3U(offset, >=, sm->sm_start);
		VERIFY3U(offset + size, <=, sm->sm_start + sm->sm_size); /* PANIC here */
		if (SM_TYPE_DECODE(e) == maptype) {
			VERIFY3U(range_tree_space(rt) + size, <=,
			    sm->sm_size);
			range_tree_add(rt, offset, size);
		} else {
			range_tree_remove(rt, offset, size);
		}
	}

I dug into the stack trace a little bit and I have some info that may be useful although I could not identify the root cause. The VERIFY3U is checking to make sure this spacemap segment is not going beyond the spacemap's upper bound. The endpoint of the segment is shown as 6755399466238464 which is a lot larger than 25690112.

In binary 6755399466238464 = 0b1000000000000000000000000001100000000100001000000000, and if you remove the top bit the number becomes 25182720 which looks like a much more reasonable number. If I'm counting correctly, the top bit is number 51. So hopefully now debugging this should just be a matter of answering the question "what is setting bit 51 during a zpool upgrade?"

This could be related to #4034 as well.

@behlendorf behlendorf added the Component: Test Suite Indicates an issue with the test framework or a test case label May 18, 2017
@tcaputi
Copy link
Contributor Author

tcaputi commented Aug 12, 2019

We actually just hit this issue (or a similar one) in production today:

[   27.991740] VERIFY3(offset + size <= sm->sm_start + sm->sm_size) failed (2364496035840 <= 171798691840)
[   27.991750] PANIC at space_map.c:115:space_map_load()
[   27.991754] Showing stack for process 1356
[   27.991762] CPU: 1 PID: 1356 Comm: z_wr_iss Tainted: P           OE   4.4.0-142-generic #168-Ubuntu
[   27.991766] Hardware name: Datto Inc. DATTO-1000/DATTO-1000, BIOS B234D006 02/19/2014
[   27.991770]  0000000000000286 638b8f34032057f4 ffff88023315b8b0 ffffffff8140a2e1
[   27.991778]  ffffffffc0d7b63a 0000000000000073 ffff88023315b8c0 ffffffffc09db4d2
[   27.991784]  ffff88023315ba48 ffffffffc09db5a5 ffff880236801b00 0000000000000030
[   27.991790] Call Trace:
[   27.991805]  [<ffffffff8140a2e1>] dump_stack+0x63/0x82
[   27.991827]  [<ffffffffc09db4d2>] spl_dumpstack+0x42/0x50 [spl]
[   27.991842]  [<ffffffffc09db5a5>] spl_panic+0xc5/0x100 [spl]
[   27.991853]  [<ffffffffc07fe723>] ? avl_add+0x73/0xa0 [zavl]
[   27.991967]  [<ffffffffc0c6519b>] ? metaslab_rt_add+0x2b/0x60 [zfs]
[   27.992058]  [<ffffffffc0c6b859>] ? range_tree_add+0x139/0x2e0 [zfs]
[   27.992154]  [<ffffffffc0c8768b>] space_map_load+0x35b/0x550 [zfs]
[   27.992244]  [<ffffffffc0c6673a>] metaslab_load+0x3a/0x100 [zfs]
[   27.992331]  [<ffffffffc0c66970>] metaslab_activate+0x90/0xb0 [zfs]
[   27.992420]  [<ffffffffc0c68625>] metaslab_alloc+0x5b5/0xf80 [zfs]
[   27.992520]  [<ffffffffc0cdf526>] zio_dva_allocate+0xb6/0x5d0 [zfs]
[   27.992529]  [<ffffffff8185edc2>] ? mutex_lock+0x12/0x30
[   27.992617]  [<ffffffffc0c67fee>] ? metaslab_class_throttle_reserve+0xae/0xc0 [zfs]
[   27.992633]  [<ffffffffc09de2b0>] ? tsd_hash_search.isra.0+0x70/0x90 [spl]
[   27.992648]  [<ffffffffc09de34e>] ? tsd_get_by_thread+0x2e/0x40 [spl]
[   27.992661]  [<ffffffffc09d8928>] ? taskq_member+0x18/0x30 [spl]
[   27.992761]  [<ffffffffc0cdc0a0>] zio_execute+0x90/0xf0 [zfs]
[   27.992774]  [<ffffffffc09d9734>] taskq_thread+0x2b4/0x4b0 [spl]
[   27.992780]  [<ffffffff8185c601>] ? __schedule+0x301/0x810
[   27.992788]  [<ffffffff810b1f10>] ? wake_up_q+0x70/0x70
[   27.992888]  [<ffffffffc0cdc010>] ? zio_reexecute+0x390/0x390 [zfs]
[   27.992901]  [<ffffffffc09d9480>] ? task_done+0xb0/0xb0 [spl]
[   27.992908]  [<ffffffff810a6227>] kthread+0xe7/0x100
[   27.992913]  [<ffffffff8185c601>] ? __schedule+0x301/0x810
[   27.992918]  [<ffffffff810a6140>] ? kthread_create_on_node+0x1e0/0x1e0
[   27.992923]  [<ffffffff818618d2>] ret_from_fork+0x42/0x80
[   27.992928]  [<ffffffff810a6140>] ? kthread_create_on_node+0x1e0/0x1e0

It looks like this issue is also the same as the (now closed) #5803. I think at this point its safe to say this probably isn't an ECC thing, but a subtle race somewhere in the code.

@ikozhukhov
Copy link
Contributor

ikozhukhov commented Aug 12, 2019

i hit similar on DilOS:
https://cl.ly/8a02252a8703
https://cl.ly/eba774012f20
they are panics after upgrade old pool (less one year ago) with features (Features in 2.0.1x) to latest with new features (Features in 2.0.2x):
https://dilos-dev.atlassian.net/wiki/spaces/DS/pages/1116766221/zpool+compatibility+features

@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020
@stale stale bot closed this as completed Nov 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Stale No recent activity for issue
Projects
None yet
Development

No branches or pull requests

3 participants