-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs-2.1.7 zfs recv panic in dsl_dataset_deactivate_feature_impl #14252
Comments
ZFS is trying to deactivate a feature that is not active and it very much does not like that. Do any variations of the send command without |
The error message talks about the large_blocks feature, although it is only enabled and not active at both the sender and receiver. The -L flag is probably responsible for triggering this. Could it be related to #13782 ? I will not be able to try anything until tomorrow. The machine is at the office and has currently experienced this issue and I am unable to ssh in. |
I can confirm that this happens also to me. zfs core dump during send/receive: It happens here if sender has resordsize 128k and receivers has recordsize 1M:
|
I tried various permuations of zfs send flags and here are the results
I use zstd-1 compression on all my datasets. compression is inherited from the top dataset. |
Does |
I have left the office now and won't be there for a while. Most of us work from home. |
just got the same panic using only zfs send -R -I snapshot_name so trying send -RLe will not shed any light on the issue.
|
I was afraid that might happen. Thanks for confirming it. |
Am Sonntag, dem 04.12.2022 um 05:38 -0800 schrieb George Amanakis:
As far as I can see #13782 is not included in 2.1.7. Is it possible
to retry with current master or cherry-pick that commit on 2.1.7 and
try again?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID:
***@***.***>
This patch is already part of zfs 2.1.7
See at the bottom of this page:
zfs-2.1.6...zfs-2.1.7
|
Is this reproducible on 2.1.6? |
No, I've been running 2.1.6 with zfs send -RLec since it's release. The script has been running unchanged for about 5 years and I have never experienced this issue. It happened instantly once I updated to zfs 2.1.7 |
Could you try reverting c8d2ab0 to see if that makes it stop? I do not see any thing else that changed in 2.1.7 that could possibly be related to this. |
Thanks, I wil try reverting c8d2ab0 It might be a couple of days before I can try it. |
I've reverted c8d2ab0 and tried the zfs send again. Unfortunately my pool seems to be in a weird state and fails with lots of errors about snapshots not existing. After the failure, on the recv end the receive dataset has been renamed to xxxrecv-xxxxx-1. I tried renaming the dataset back to raid3/raid2 but each time I did the zfs send it was renamed to raid3/raid2recv-xxxxx-1 again. I am running a pool scrub and will try again once it has finished. The sending pool is raid2 and it is recevied into raid3/raid2 on the receiving end. All the snapshots mentioned in the errors below exist on the receving end, they've just be renamed to raid3/raid2recv-65018-1 Here is an extract of some of the errors.
|
I did a full test again, but reverting the patch did not help. At least, what I found, reverting this patch prevents a complete freeze of the pool. But I still see coredumps during send/receive This is the patch I am reverting:
With stock zfs 2.1.7 (incl. this patch) my pool completely freezes when doing send / receive:
gives the following in the log:
I see multiple of those blocked task messages for various tasks. The pool is then frozen. Reboot takes ages and I had to force reboot with REISUB. When I reverted the patch I did not experience freezes but the same error and coredumps. Unfortuanlety I can not provide a coredump because my system says: Anyways, interesting is also that this error resp. coredump does not happen every time. Approximately 1 out of 4 send/receive processes execute fine. And just to reassure you: This is a new bug in 2.1.7. I am using this backup process with syncoid since many years without any problem. Reverting back to zfs 2.1.6 gives me a stable system again. |
Could you do a git bisect with good being 2.1.6 and bad 2.1.7?
…On Fri, Dec 9, 2022, 11:51 AM mabod ***@***.***> wrote:
I did a full test again, but reverting the patch did not help. At least,
what I found, reverting this patch prevents a complete freeze of the pool.
But I still see coredumps during send/receive
This is the patch I am reverting:
From 1af2632 Mon Sep 17 00:00:00 2001
From: George Amanakis ***@***.***>
Date: Tue, 30 Aug 2022 22:15:56 +0200
Subject: [PATCH] Fix setting the large_block feature after receiving a
snapshot
We are not allowed to dirty a filesystem when done receiving
a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
not be set on that filesystem since the filesystem is not on
dp_dirty_datasets, and a subsequent encrypted raw send will fail.
Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature
needs to be activated and do so if appropriate.
Signed-off-by: George Amanakis ***@***.***>
---
module/zfs/dsl_dataset.c | 15 ++++
tests/runfiles/common.run | 2 +-
tests/zfs-tests/tests/Makefile.am | 1 +
.../rsend/send_raw_large_blocks.ksh | 78 +++++++++++++++++++
4 files changed, 95 insertions(+), 1 deletion(-)
create mode 100755 tests/zfs-tests/tests/functional/rsend/send_raw_large_blocks.ksh
diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index c7577fc584a..4da4effca60 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1760,6 +1760,21 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname,
}
}
+ /*
+ * We are not allowed to dirty a filesystem when done receiving
+ * a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
+ * not be set and a subsequent encrypted raw send will fail. Hence
+ * activate this feature if needed here.
+ */
+ for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
+ if (zfeature_active(f, ds->ds_feature_activation[f]) &&
+ !(zfeature_active(f, ds->ds_feature[f]))) {
+ dsl_dataset_activate_feature(dsobj, f,
+ ds->ds_feature_activation[f], tx);
+ ds->ds_feature[f] = ds->ds_feature_activation[f];
+ }
+ }
+
ASSERT3U(ds->ds_prev != 0, ==,
dsl_dataset_phys(ds)->ds_prev_snap_obj != 0);
if (ds->ds_prev) {
With stock zfs 2.1.7 (incl. this patch) my pool completely freezes when
doing send / receive:
syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home
NEWEST SNAPSHOT: 2022-12-09--10:47
Sending incremental ***@***.***:32 ... 2022-12-09--10:47 (~ 4 KB):
0,00 B 0:00:00 [0,00 B/s] [> ] 0%
cannot receive: failed to read from stream
CRITICAL ERROR: zfs send -L -I 'zHome/home'@'2022-12-09--10:32' 'zHome/home'@'2022-12-09--10:47' | mbuffer -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 4720 | zfs receive -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817.
gives the following in the log:
Dez 09 10:35:47 rakete kernel: INFO: task txg_sync:3044 blocked for more than 122 seconds.
Dez 09 10:35:47 rakete kernel: Tainted: P OE 6.0.11-zen1-1-zen #1
Dez 09 10:35:47 rakete kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dez 09 10:35:47 rakete kernel: task:txg_sync state:D stack: 0 pid: 3044 ppid: 2 flags:0x00004000
Dez 09 10:35:47 rakete kernel: Call Trace:
Dez 09 10:35:47 rakete kernel: <TASK>
Dez 09 10:35:47 rakete kernel: __schedule+0xb43/0x1350
Dez 09 10:35:47 rakete kernel: schedule+0x5e/0xd0
Dez 09 10:35:47 rakete kernel: spl_panic+0x10a/0x10c [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_dataset_clone_swap_sync_impl+0x90b/0xe30 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_dataset_rollback_sync+0x109/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_sync_task_sync+0xac/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_pool_sync+0x40d/0x5c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: spa_sync+0x56c/0xf90 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: ? spa_txg_history_init_io+0x193/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: txg_sync_thread+0x22b/0x3f0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: ? txg_wait_open+0xf0/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: ? __thread_exit+0x20/0x20 [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel: thread_generic_wrapper+0x5e/0x70 [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel: kthread+0xde/0x110
Dez 09 10:35:47 rakete kernel: ? kthread_complete_and_exit+0x20/0x20
Dez 09 10:35:47 rakete kernel: ret_from_fork+0x22/0x30
Dez 09 10:35:47 rakete kernel: </TASK>
I see multiple of those blocked task messages for various tasks. The pool
is then frozen. Reboot takes ages and I had to force reboot with REISUB.
When I reverted the patch I did not experience freezes but the same error
and coredumps. Unfortuanlety I can not provide a coredump because my system
says: Resource limits disable core dumping for process 82335 (zfs) And I
do not know why that is.
Anyways, interesting is also that this error resp. coredump does not
happen every time. Approximately 1 out of 4 send/receive processes execute
fine.
And just to reassure you: This is a new bug in 2.1.7. I am using this
backup process with syncoid since many years without any problem. Reverting
back to zfs 2.1.6 gives me a stable system again.
—
Reply to this email directly, view it on GitHub
<#14252 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2Y2IOXPMFSF6EKS3ZV4KLWMMFM5ANCNFSM6AAAAAASSJ2TW4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I could do that but I need some help. I am not a programmer resp. git expert. I need guidance on this. I have a couple of questions to start with:
|
The straightforward way would be to compile manually:
|
step 5 does not work for me. I have the module running and I have active pools. How do I reload the zfs module in a running live environment? |
./scripts/zfs.sh -vu should unload the modules. Make sure you have exported all pools. |
I have done some comprehensive testing after reverting c8d2ab0 and can report that I am no longer getting the zfs recv panic in dsl_dataset_deactivate_feature_impl. I tried lots incremental sends using the full zfs send -RLec and also a full send. That definitely fixed it. @mabod your coredump issue is probably related to another commit. |
@peterska Would you be willing to try a possible fix? |
@gamanakis Let me review the patch and if it looks ok to me I will test it. Will not be able to test it until Monday Sydney time. |
I exported all pools. I rebooted. I verfied with the 'zpool status' command that no polls are importedt. But when I excecute "./scripts/zfs.sh -vu" it fails by saying "module is still in use". Why is that? |
yes, must be. I checked with the reverted pacth again. The coredump does exist. no doubt. |
@peterska I am thinking that in your case the code inserted in
The patch I was thinking about is the following:
This runs the code inserted in |
@gamanakis the patch looks ok to me. I will try it later today and let you know the results. To help you reproduce the panic, here are the non default settings of my pool and dataset:
|
@gamanakis I finally got a chance to test out your patch today. I could not apply it directly, I guess github changed some spaces/tabs, so I patched it manually. The bottom of this comment shows the surrounding code so I can convince you I applied it correctly. I ran about 8 zfs sned -RLec commands and there were no lockups/panics. I have left the new code running on my server so we can detect any issues as it goes about it daily backup tasks.
|
I don’t think it’s encryption related. I got a crash when receiving into 2.1.7 from 0.8.x. Neither side has any encryption enabled. Sender has lz4 whereas receiver is zstd-19. I’ll try reverting to lz4 receiver when I get a chance. Of note, the first full send (using syncoid) worked without issue. It was the next, incremental send which crashed my system. |
I could certainly believe it's not encryption related, I've just seen the quota feature and encryption interact poorly before. My guess would be that it's a timing issue where something doesn't actually block on something else happening first, and if you use encryption or high compression with large blocks or the like, one task takes long enough that the other hits this. |
Quickly, I added two log lines in the actual deactivate_impl, and got:
So it seems it tries to deactivate, if I can count, |
I think it has to do with the code here:
If I remove that then it doesn't panic. |
The above change to dmu_objset.c works. It fixes my syncoid backup, which send from zfs 2.0 pool to an encrypted zfs 2.1 pool. feature@userobj_accounting and feature@project_quota are activated in both pools. |
@gamanakis : |
Is there a workaround that doesn’t require a code change? Maybe a flag I can pass into the send or receive command to avoid the code path that causes panic? I’m in a situation where one server is offsite and unmanaged and if I happen to break it by injecting my own zfs modules then recovery will be tricky. |
When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14304 Closes #14252
When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#14304 Closes openzfs#14252
When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#14304 Closes openzfs#14252
When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#14304 Closes openzfs#14252
I just got this line (the only one which I could read of the screen) and needed to reboot:
with zfs 2.1.9:
|
@phreaker0 can you provide additional information? were you sending/receiving when this happened? |
@gamanakis i wasn't around when it happened, what I did before I left:
but I guess the crash happened pretty soon, because my influxdb logging stuff didn't record anything any more. The ssd pool is used as ROOT. Therefore I couldn't login any more after the panic to gather more information, I only had the screen output. I'm trying to reproduce it now, but so far it works, I will keep an active ssh session with the kernel log in case of a crash and will report back. |
FYI: I can't reproduce it anymore, it ran for several days. |
Issue: openzfs/zfs#14252 Pull request: openzfs/zfs#14304
Issue: openzfs/zfs#14252 Pull request: openzfs/zfs#14304
I've just had this happen on RHEL 9 with 2.1.9, pool comprised from LUKS encrypted devices (which I'm using because of the send/receive issues with native encryption #11679 - aargh!). VERIFY0(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2) With /var/log/messages reporting: Feb 17 08:51:32 fs6 kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2) As this is part of a HA pair, the host was fenced shortly afterwards and service resumed on the peer host, where the pool imported with no errors. Pushing the service back to the original host and restarting the transfer has so far not resulted in a repetition. Load on the system is minimal during the transfer, ca 1.3. |
Same here on zfs 2.1.9 when doing a rollback on an unmounted dataset: Linux duranux2 6.0.19 #1 SMP PREEMPT_DYNAMIC Thu Feb 9 01:12:28 CET 2023 x86_64 GNU/Linux févr. 18 20:07:32 duranux2 kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2) |
@scratchings @duramuss There is a more suitable fix, see 34ce4c4 and #14502. Hopefully the upcoming 2.1.10 will include them. |
When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#14304 Closes openzfs#14252
Same here. After 30 days of uptime. No ZFS commands issued during last 2 weeks, just normal ops.
Will rebuild from |
@slavanap this should be fixed in current |
I patched 2.1.9 using 34ce4c4 and #14502 and deployed this yesterday, I'm sorry to say I'm still getting panics - two today: Message from syslogd@fs7 at Mar 9 16:24:12 ... Message from syslogd@fs7 at Mar 9 16:24:12 ... |
To everyone interested: give zfs-2.1.10-staging a try. It contains already all the fixes. |
When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#14304 Closes openzfs#14252 (cherry picked from commit eee9362)
System information
Describe the problem you're observing
When using zfs send to make a backup on a remote machine, the receiver throws a PANIC in one of the zfs functions and the file system deadlocks.
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: