-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel panic on incremental send/recv between two encrypted datasets on the same pool, dest is using zstd-19 #12785
Comments
I allowed this scrub to complete normally, then sent the remaining snapshots to the destination dataset. The "permanent errors" were still listed against the pool. Then I performed a second scrub, and after that, there are no permanent errors listed. I have not rebooted or export/import etc. in the meantime.
This "permanent errors", "data corruption", and "Otherwise restore the entire pool from backup." messaging really needs to change. It's rather frightening, and the solution appears to be to scrub it not once, but twice. |
Well, it happened again, under similar conditions:
The output from pv indicates that 21.3 GB of the stream has been processed so far. I noticed some extra things in the system log, which could possibly be involved in triggering this issue:
Note that sanoid is never touching the src/dest datasets, but it is interacting with the pool that they are on. Also syncoid is sending snapshots onto this pool, every few minutes. I'm not sure if a reboot was required, I was able to mount the source dataset, I was able to look at its properties. But, the dest is hanging around mid-receive, and the receive process couldn't be killed. I tried to unmound the source dataset, and that command froze, so I decided to reboot. Prior to actually rebooting, I took a
The dataset mentioned on the last line above, is receiving via syncoid, on minutes ending in 1 or 6, via the following invocation from cron. (It's currently the only instance of syncoid being run anywhere.)
It's likely not a big delta at all, and there's every chance that this is what some of the CRON log messages are about. |
Upon reboot, the pool status shows the same as above. A scrub has not begun, and the error shows the same, with I then tried
I checked again, and the token was gone. I don't know whether syncoid did this, but as the key for rpool was not loaded, syncoid was unable to send new snapshots. The end of
I loaded the keys, and manually invoked syncoid. It ran fine, the snapshots have appeared on |
Once again, after the second scrub, the error is no longer shown. I haven't rebooted, and I have been sending other datasets to this pool in the meantime. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
FWIW, I've diffed at least a TB of data between snapshots mounted side by side, lz4 and zstd-19. So far no differences found, no data loss that I'm aware of. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Back to issue. No one will answers question @darrenfreeman is asking because this issue needs investigation and from my understanding there is no big interest in solving send/receive issues also with encryption so most probably this issue won't be resolved ever or You could try to resolve it alone :) as there is no support from my understanding. |
I am investigating without looking at any code. My current test is to recompress 3 TB, after dropping the L2ARC, SLOG, and stopping sanoid/syncoid. Many would find that a reasonable compromise if it succeeds, since those can be added back afterwards. |
Maybe someone can shed a light what this assert means because it it returning EIO:
It could lead to source of troubles ? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
The symptoms seem identical to #12001 and #12270, which don't use zstd. So I don't see any evidence that this is specific to ZSTD. I'm going to mark zstd-specific comments as offtopic. Polite suggestions for how to improve ZSTD are welcome in other (perhaps new) issues. Similarly, I don't think that memory usage or overall speed/performance is at issue here; marked those as offtopic as well. It seems probable that the EIO (5) here is not due to hardware failure, but rather something going wrong with the zfs receive, which needs to be investigated. |
The second hang, above, showed that a snapshot was being received via syncoid. And this snapshot was the one with the "permanent" error in its partially received state. I have just transferred a large single snapshot, without a hang. Syncoid was not running, and the send/recv was not incremental. Now going to reinstate sanoid and syncoid. |
There are a few places that can return EIO in this path. Can you try reproducing with SET_ERROR tracing enabled, It might also be helpful to see Let's also get all the properties involved, |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Describe the problem you're observing
The
zfs send | zfs recv
is frozen, and an attempt tozfs set mountpoint=
on the source dataset also hanged. So the system had to be rebooted.Upon reboot, a scrub appears to have started by itself.
Describe how to reproduce the problem
I don't think this will reproduce, but it's what I was doing at the time:
Source and destination datasets are encrypted using the same root. The only difference should be that
compression=lz4
on the source, butcompression=zstd-19
on the dest. CPU usage was around 75%, on this system which is otherwise at 1% - I attribute this to zstd compression. This is not the first dataset, on this pool, that I have successfully recompressed using the above technique. I have also done many TB of raw send/recv of encrypted datasets between pools, without rebooting, before this step which failed.Include any warning/errors/backtraces from the system logs
Subsequently, sanoid attempted to snapshot an unrelated dataset. I'm not sure whether it completed, but I suspect not.
And then the inevitable time-out:
(Omitted three more similar timeouts.)
The text was updated successfully, but these errors were encountered: