-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic on zfs receive #10698
Comments
It looks like you are doing a "blow away" receive, where we are receiving a full (non-incremental) stream on top of a filesystem that already exists (dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch). In this case, we receive into a temporary Any chance you can provide a crash dump? |
If you haven't configured the system to reboot when zfs panic's, you might be able to get the output of |
this must first generally be enabled by toggling debug flags (on Linux) before reproducing the issue. they might also want to bump the max size of the log. @ahrens can you suggest some debug flags they might want to enable before re-testing, to save time? |
I noticed that in the NOOP send, there's the line:
The source dataset has that value:
As does the destination:
I set that set that value a while back on the source dataset to the maximum value I could. So maybe there's a off-by-1 bug or something... Since the dmesg line mentions a function ( dsl_fs_ss_count_adjust ) that appears related to snapshot count:
... looks like it might be related to this property.
Where it PANICED:
I bet its something there... I can still provide a dbgmsg output if you like, but I'm out of gas for the moment as I've been trying a few things to workaround this, but I do know that lowering the snapshot limit values on the source and destination dataset before the send/recv did not make a difference. |
Ahh, what the heck: Here's the only dbgmsg output from the recv command:
|
This also crashes when sending to an entirely new dataset on a different pool, different server (5.4.0-40-generic / 0.8.3-1ubuntu12), even after reducing the snapshot_limit to 9999999 (but this time it sends all the data before crashing). |
I think that the panic occurs when destroying the /*
* When we receive an incremental stream into a filesystem that already
* exists, a temporary clone is created. We don't count this temporary
* clone, whose name begins with a '%'. We also ignore hidden ($FREE,
* $MOS & $ORIGIN) objsets.
*/
if ((dd->dd_myname[0] == '%' || dd->dd_myname[0] == '$') &&
strcmp(prop, DD_FIELD_FILESYSTEM_COUNT) == 0)
return; |
Hmm. So either the assertion fails when What's the best way for me to get more details on what's happening? I can run bpftrace if someone can point me to the probes to monitor. |
You could use bpftrace to print out the stack, |
I ran into another problem (#10787) that's kept me from further troubleshooting this. |
I was hoping this would have been resolved by some changes in the 2.0 release, but I am still encountering what appears to be the same PANIC when attempting to destroy a dataset.
|
Dataset details:
The PANIC does not output anything to |
The originally-reported issue should have been resolved by #10791, which is fixed in 2.0. It's possible that if your filesystem (data/myhost-main/data/backupdata/veeam/repos/repo_01_old) was created on an earlier release, you could still hit this when destroying it. |
I also tried aborting the send:
From dmseg:
|
Yes, it was created in the 0.8.x release. Any idea how I can safely get rid of this dataset then? |
If you can remove the VERIFY and compile from source, the destroy might "just work" and not introduce any more problems. |
I have yet to be able to try that as the system encountering this problem is in production and it's critical that it stay up and running. But I have also now encountered this problem on another system that has hundreds of datasets that I am unable to destroy or destroy snapshots from due to the same problem. And since this server is continuously receiving snapshots from other systems, it is only a matter of time before this pool runs out of space. Yikes! I am now in a race against the clock to solve this, so I will likely be testing this shortly. But let's say it does work... Shouldn't that change (or a more proper fix for the issue) be made to zfs (not just the manually compiled version on my server) to avoid this problem showing up again in the future for anyone else? Also, I can't readily tell at the moment, but is that change to dsl_dir.c part of the kernel module or the zfs userland CLI program? Thanks! |
On a system running zfs 0.8.4 I have rebuilt zfs, commenting out the VERIFY statement in question. So far I have been able to destroy such problematic datasets (and snapshots on those datasets) without any visible sign of a problem. How should we move forward with this information to fix this problem permanently for everyone? |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Describe the problem you're observing
I am encountering a PANIC of ZFS upon a zfs recv that hangs the pool until the system is restarted.
I initially encountered this problem with Ubuntu 20.04's official version of ZFS (0.8.3) and then tried upgrading to 0.8.4 to see if the problem still exists. The PANIC occurs in both versions. I used the pre-built version from this PPA:
https://launchpad.net/~jonathonf/+archive/ubuntu/zfs
Describe how to reproduce the problem
The zfs send command NOOP output:
I can issue a zfs send to /dev/null that is successful:
The exact zfs send command used when the destination crashes:
The zfs recv command that crashes:
Properties of the source dataset:
Include any warning/errors/backtraces from the system logs
Relevant dmesg:
I've done plenty of sends and receives between these 2 systems but this only particular dataset is causing this problem, so reproducing it elsewhere may be difficult.
Is this certainly a problem with the destination side, or could it be a problem with the ZFS stream?
What other information can I provide that would help narrow the problem down?
Should the version of ZFS on the source machine matter even though its the receive that is crashing?
The source machine is running kernel 5.4.0-40-generic with the official Ubuntu zfs/spl 0.8.3-1ubuntu12.2.
The text was updated successfully, but these errors were encountered: