ASSERT panic in zpool_upgrade_004_pos.ksh #6141

tcaputi · 2017-05-18T20:04:04Z

System information

Type	Version/Name
Distribution Name
Distribution Version
Linux Kernel
Architecture
ZFS Version
SPL Version

Describe the problem you're observing

zpool_upgrade_004_pos.ksh spuriously triggers an ASSERT on buildbot.

Describe how to reproduce the problem

Problem can be reproduced via automated buildbot testing occasionally and by running the zpool_upgrade zfs-tests in a loop if you're lucky.

Include any warning/errors/backtraces from the system logs

[ 8477.027346] VERIFY3(offset + size <= sm->sm_start + sm->sm_size) failed (6755399466238464 <= 25690112)
[ 8477.032642] PANIC at space_map.c:115:space_map_load()
[ 8477.035452] Showing stack for process 3322
[ 8477.035456] CPU: 1 PID: 3322 Comm: z_wr_iss Tainted: P           OE   4.9.2-kmemleak #1
[ 8477.035457] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
[ 8477.035459]  ffffae1dcf03f958 ffffffff98619156 ffffffffc1052519 0000000000000073
[ 8477.035464]  ffffae1dcf03f968 ffffffffc0c65802 ffffae1dcf03faf0 ffffffffc0c65ad9
[ 8477.035468]  ffff9556dcc07dc0 ffffae1d00000030 ffffae1dcf03fb00 ffffae1dcf03faa0
[ 8477.035472] Call Trace:
[ 8477.035480]  [<ffffffff98619156>] dump_stack+0xb0/0xea
[ 8477.035495]  [<ffffffffc0c65802>] spl_dumpstack+0x62/0x70 [spl]
[ 8477.035503]  [<ffffffffc0c65ad9>] spl_panic+0xd9/0x120 [spl]
[ 8477.035509]  [<ffffffff98c9e387>] ? ftrace_call+0x5/0x34
[ 8477.035518]  [<ffffffffc0c65a05>] ? spl_panic+0x5/0x120 [spl]
[ 8477.035635]  [<ffffffffc0ec7986>] space_map_load+0x3d6/0x7e0 [zfs]
[ 8477.035713]  [<ffffffffc0e9190b>] metaslab_load+0x7b/0x280 [zfs]
[ 8477.035789]  [<ffffffffc0e91e78>] metaslab_activate+0x128/0x1e0 [zfs]
[ 8477.035864]  [<ffffffffc0e95b22>] metaslab_alloc+0x952/0x1f10 [zfs]
[ 8477.035945]  [<ffffffffc0f58b37>] zio_dva_allocate+0x217/0xc70 [zfs]
[ 8477.035949]  [<ffffffff98c9e387>] ? ftrace_call+0x5/0x34
[ 8477.035958]  [<ffffffffc0c6aa42>] ? tsd_hash_search.isra.0+0x52/0xf0 [spl]
[ 8477.036037]  [<ffffffffc0f58925>] ? zio_dva_allocate+0x5/0xc70 [zfs]
[ 8477.036046]  [<ffffffffc0c60290>] ? taskq_member+0x20/0x40 [spl]
[ 8477.036146]  [<ffffffffc0f4fea9>] zio_execute+0x149/0x3f0 [zfs]
[ 8477.036156]  [<ffffffffc0c617ed>] taskq_thread+0x36d/0x760 [spl]
[ 8477.036161]  [<ffffffff9810bc80>] ? try_to_wake_up+0x680/0x680
[ 8477.036169]  [<ffffffffc0c61480>] ? taskq_thread_spawn+0x80/0x80 [spl]
[ 8477.036172]  [<ffffffff980fabd9>] kthread+0x119/0x150
[ 8477.036175]  [<ffffffff980faac0>] ? kthread_flush_work+0x1f0/0x1f0
[ 8477.036177]  [<ffffffff98c9ccf5>] ret_from_fork+0x25/0x30

This refers to this bit of code in space_map_load():

	for (entry = entry_map; entry < entry_map_end; entry++) {
		uint64_t e = *entry;
		uint64_t offset, size;

		if (SM_DEBUG_DECODE(e))		/* Skip debug entries */
			continue;

		offset = (SM_OFFSET_DECODE(e) << sm->sm_shift) +
		    sm->sm_start;
		size = SM_RUN_DECODE(e) << sm->sm_shift;

		VERIFY0(P2PHASE(offset, 1ULL << sm->sm_shift));
		VERIFY0(P2PHASE(size, 1ULL << sm->sm_shift));
		VERIFY3U(offset, >=, sm->sm_start);
		VERIFY3U(offset + size, <=, sm->sm_start + sm->sm_size); /* PANIC here */
		if (SM_TYPE_DECODE(e) == maptype) {
			VERIFY3U(range_tree_space(rt) + size, <=,
			    sm->sm_size);
			range_tree_add(rt, offset, size);
		} else {
			range_tree_remove(rt, offset, size);
		}
	}

I dug into the stack trace a little bit and I have some info that may be useful although I could not identify the root cause. The VERIFY3U is checking to make sure this spacemap segment is not going beyond the spacemap's upper bound. The endpoint of the segment is shown as 6755399466238464 which is a lot larger than 25690112.

In binary 6755399466238464 = 0b1000000000000000000000000001100000000100001000000000, and if you remove the top bit the number becomes 25182720 which looks like a much more reasonable number. If I'm counting correctly, the top bit is number 51. So hopefully now debugging this should just be a matter of answering the question "what is setting bit 51 during a zpool upgrade?"

This could be related to #4034 as well.

The text was updated successfully, but these errors were encountered:

tcaputi · 2019-08-12T18:50:31Z

We actually just hit this issue (or a similar one) in production today:

[   27.991740] VERIFY3(offset + size <= sm->sm_start + sm->sm_size) failed (2364496035840 <= 171798691840)
[   27.991750] PANIC at space_map.c:115:space_map_load()
[   27.991754] Showing stack for process 1356
[   27.991762] CPU: 1 PID: 1356 Comm: z_wr_iss Tainted: P           OE   4.4.0-142-generic #168-Ubuntu
[   27.991766] Hardware name: Datto Inc. DATTO-1000/DATTO-1000, BIOS B234D006 02/19/2014
[   27.991770]  0000000000000286 638b8f34032057f4 ffff88023315b8b0 ffffffff8140a2e1
[   27.991778]  ffffffffc0d7b63a 0000000000000073 ffff88023315b8c0 ffffffffc09db4d2
[   27.991784]  ffff88023315ba48 ffffffffc09db5a5 ffff880236801b00 0000000000000030
[   27.991790] Call Trace:
[   27.991805]  [<ffffffff8140a2e1>] dump_stack+0x63/0x82
[   27.991827]  [<ffffffffc09db4d2>] spl_dumpstack+0x42/0x50 [spl]
[   27.991842]  [<ffffffffc09db5a5>] spl_panic+0xc5/0x100 [spl]
[   27.991853]  [<ffffffffc07fe723>] ? avl_add+0x73/0xa0 [zavl]
[   27.991967]  [<ffffffffc0c6519b>] ? metaslab_rt_add+0x2b/0x60 [zfs]
[   27.992058]  [<ffffffffc0c6b859>] ? range_tree_add+0x139/0x2e0 [zfs]
[   27.992154]  [<ffffffffc0c8768b>] space_map_load+0x35b/0x550 [zfs]
[   27.992244]  [<ffffffffc0c6673a>] metaslab_load+0x3a/0x100 [zfs]
[   27.992331]  [<ffffffffc0c66970>] metaslab_activate+0x90/0xb0 [zfs]
[   27.992420]  [<ffffffffc0c68625>] metaslab_alloc+0x5b5/0xf80 [zfs]
[   27.992520]  [<ffffffffc0cdf526>] zio_dva_allocate+0xb6/0x5d0 [zfs]
[   27.992529]  [<ffffffff8185edc2>] ? mutex_lock+0x12/0x30
[   27.992617]  [<ffffffffc0c67fee>] ? metaslab_class_throttle_reserve+0xae/0xc0 [zfs]
[   27.992633]  [<ffffffffc09de2b0>] ? tsd_hash_search.isra.0+0x70/0x90 [spl]
[   27.992648]  [<ffffffffc09de34e>] ? tsd_get_by_thread+0x2e/0x40 [spl]
[   27.992661]  [<ffffffffc09d8928>] ? taskq_member+0x18/0x30 [spl]
[   27.992761]  [<ffffffffc0cdc0a0>] zio_execute+0x90/0xf0 [zfs]
[   27.992774]  [<ffffffffc09d9734>] taskq_thread+0x2b4/0x4b0 [spl]
[   27.992780]  [<ffffffff8185c601>] ? __schedule+0x301/0x810
[   27.992788]  [<ffffffff810b1f10>] ? wake_up_q+0x70/0x70
[   27.992888]  [<ffffffffc0cdc010>] ? zio_reexecute+0x390/0x390 [zfs]
[   27.992901]  [<ffffffffc09d9480>] ? task_done+0xb0/0xb0 [spl]
[   27.992908]  [<ffffffff810a6227>] kthread+0xe7/0x100
[   27.992913]  [<ffffffff8185c601>] ? __schedule+0x301/0x810
[   27.992918]  [<ffffffff810a6140>] ? kthread_create_on_node+0x1e0/0x1e0
[   27.992923]  [<ffffffff818618d2>] ret_from_fork+0x42/0x80
[   27.992928]  [<ffffffff810a6140>] ? kthread_create_on_node+0x1e0/0x1e0

It looks like this issue is also the same as the (now closed) #5803. I think at this point its safe to say this probably isn't an ECC thing, but a subtle race somewhere in the code.

ikozhukhov · 2019-08-12T19:05:37Z

i hit similar on DilOS:
https://cl.ly/8a02252a8703
https://cl.ly/eba774012f20
they are panics after upgrade old pool (less one year ago) with features (Features in 2.0.1x) to latest with new features (Features in 2.0.2x):
https://dilos-dev.atlassian.net/wiki/spaces/DS/pages/1116766221/zpool+compatibility+features

stale · 2020-08-24T19:31:30Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf added the Component: Test Suite Indicates an issue with the test framework or a test case label May 18, 2017

behlendorf mentioned this issue May 26, 2017

Test case: zpool_upgrade_002_pos, zpool_upgrade_004_pos #4034

Closed

stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020

stale bot closed this as completed Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASSERT panic in zpool_upgrade_004_pos.ksh #6141

ASSERT panic in zpool_upgrade_004_pos.ksh #6141

tcaputi commented May 18, 2017 •

edited by behlendorf

Loading

tcaputi commented Aug 12, 2019

ikozhukhov commented Aug 12, 2019 •

edited

Loading

stale bot commented Aug 24, 2020

ASSERT panic in zpool_upgrade_004_pos.ksh #6141

ASSERT panic in zpool_upgrade_004_pos.ksh #6141

Comments

tcaputi commented May 18, 2017 • edited by behlendorf Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

tcaputi commented Aug 12, 2019

ikozhukhov commented Aug 12, 2019 • edited Loading

stale bot commented Aug 24, 2020

tcaputi commented May 18, 2017 •

edited by behlendorf

Loading

ikozhukhov commented Aug 12, 2019 •

edited

Loading