-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File incorrectly zeroed when receiving incremental stream that toggles -L #6224
Comments
I think I found the unifying characteristic of all corrupted files: All of them are in directories that have been updated via file creation or deletion between snapshots. Which of the files in the directory get corrupted seems to be random, though. |
I'll try to put something together. I'm busy for the next week though, so it'll take a while. |
I can reproduce the issue consistently with this script, tested on two different machines (0.6.5.6 and 0.6.5.9):
After running, check the size on disk of the files in the target dataset using du or ncdu. Note that the first file is only 512 bytes and all zero content.
|
@zrav thank you for providing a working reproducer! I've not had the time to debug this properly yet, but after a cursory reading of the code i think we're seeing the result of the following block in 2134 /*
2135 * If we are losing blkptrs or changing the block size this must
2136 * be a new file instance. We must clear out the previous file
2137 * contents before we can change this type of metadata in the dnode.
2138 */
2139 if (err == 0) {
2140 int nblkptr;
2141
2142 nblkptr = deduce_nblkptr(drro->drr_bonustype,
2143 drro->drr_bonuslen);
2144
2145 if (drro->drr_blksz != doi.doi_data_block_size ||
2146 nblkptr < doi.doi_nblkptr) {
2147 err = dmu_free_long_range(rwa->os, drro->drr_object,
2148 0, DMU_OBJECT_END);
2149 if (err != 0)
2150 return (SET_ERROR(EINVAL));
2151 }
2152 } When we send the incremental stream our First send (blksz = 1M)
Incremental send (blksz = 128K)
The resulting file on disk is:
The code i was mentioning earlier was integrated with 6c59307 (Illumos 3693 - restore_object uses at least two transactions to restore an object), even before f1512ee (Illumos 5027 - zfs large block support): if all this is true i think every ZFS version since Illumos 5027 is affected, possibly on other platfroms too. |
This reproduces on current master and Illumos, which is quite troubling. Cross-platform reproducer here https://gist.github.com/loli10K/af70fb0f1aae7b174822aa5657e07c28. ZoL:
Illumos (SmartOS):
|
FWIW, although it says above that there is a cross-platform that reproduces on Illumos, @lundman notes that it hasn't made the slack and wonders if @ahrens has seen this. (Reproduces in openzfsonosx too, and annoyingly I have a few affected files in backups that thankfully I haven't had to restore from.) |
@behlendorf any particular reason this wasn't revisited? |
Available developer resources, I'll add it to the 0.8 milestone so we don't loose track of it. |
Just for reference, a similar issue was reported: |
@loli10K that issue is related but different. I would guess the problem here is with turning The solution might be a bit tricky, but I think it will have to involve disallowing some kinds of |
@pcd1193182 or @tcaputi also have some experience with zfs send/recv. |
We are having the same kind of issue in openzfs/openzfs#705.... I don't have a solution right at this moment (other than simply forcing zfs or the user to use |
I just tried this on an encrypted dataset with raw sending. Since I use legacy mountpoint I altered the script to:
and it produced:
Also after mounting it seems to be fine
It seems when using raw send on encrypted dataset that this bug is not triggered. |
Right, because |
raw sends imply |
ok, thanks for the explanation. |
When receiving an object in a send stream the receive_object() function must determine if it is an existing or new object. This is normally straight forward since that object number will usually not be allocated, and therefore it must be a new object. However, when the object exists there are two possible scenarios. 1) The object may have been freed and an entirely new object allocated. In which case it needs to be reallocated to free any attached spill block and to set the new attributes (i.e. block size, bonus size, etc). Or, 2) The object's attributes, like block size, we're modified at the source but it is the same original object. In which case only those attributes should be updated, and everything else preserved. The issue is that this determination is accomplished using a set of heuristics from the OBJECT record. Unfortunately, these fields aren't sufficient to always distinguish between these two cases. The result of which is that a change in the objects block size will result it in being reallocated. As part of this reallocation any spill block associated with the object will be freed. When performing a normal send/recv this issue will most likely manifest itself as a file with missing xattrs. This is because when the xattr=sa property is set the xattrs can be stored in this lost spill block. If this issue occurs when performing a raw send then the missing spill block will trigger an authentication error. This error will prevent the receiving side for accessing the damaged dnode block. Furthermore, if first dnode block is damaged in this way it will make it impossible to mount the received snapshot. This change resolves the issue by updating the sender to always include a SPILL record for each OBJECT record with a spill block. This allows the missing spill block to be recreated if it's freed during receive_object(). The major advantage of this approach is that it is backwards compatible with existing versions of 'zfs receive'. This means there's no need to add an incompatible feature flag which is only understood by the latest versions. Older versions of the software which already know how to handle spill blocks will do the right thing. The downside to this approach is that it can increases the size of the stream due to the additional spill blocks. Additionally, since new spill blocks will be written the received snapshot will consume more capacity. These drawbacks can be largely mitigated by using the large dnode feature which reduces the need for spill blocks. Both the send_realloc_files and send_realloc_encrypted_files ZTS test cases were updated to create xattrs in order to force spill blocks. As part of validating an incremental receive the contents of all received xattrs are verified against the source snapshot. OpenZFS-issue: https://www.illumos.org/issues/9952 FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277 Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#6224
When receiving an object in a send stream the receive_object() function must determine if it is an existing or new object. This is normally straight forward since that object number will usually not be allocated, and therefore it must be a new object. However, when the object exists there are two possible scenarios. 1) The object may have been freed and an entirely new object allocated. In which case it needs to be reallocated to free any attached spill block and to set the new attributes (i.e. block size, bonus size, etc). Or, 2) The object's attributes, like block size, we're modified at the source but it is the same original object. In which case only those attributes should be updated, and everything else preserved. The issue is that this determination is accomplished using a set of heuristics from the OBJECT record. Unfortunately, these fields aren't sufficient to always distinguish between these two cases. The result of which is that a change in the objects block size will result it in being reallocated. As part of this reallocation any spill block associated with the object will be freed. When performing a normal send/recv this issue will most likely manifest itself as a file with missing xattrs. This is because when the xattr=sa property is set the xattrs can be stored in this lost spill block. If this issue occurs when performing a raw send then the missing spill block will trigger an authentication error. This error will prevent the receiving side for accessing the damaged dnode block. Furthermore, if first dnode block is damaged in this way it will make it impossible to mount the received snapshot. This change resolves the issue by updating the sender to always include a SPILL record for each OBJECT record with a spill block. This allows the missing spill block to be recreated if it's freed during receive_object(). The major advantage of this approach is that it is backwards compatible with existing versions of 'zfs receive'. This means there's no need to add an incompatible feature flag which is only understood by the latest versions. Older versions of the software which already know how to handle spill blocks will do the right thing. The downside to this approach is that it can increases the size of the stream due to the additional spill blocks. Additionally, since new spill blocks will be written the received snapshot will consume more capacity. These drawbacks can be largely mitigated by using the large dnode feature which reduces the need for spill blocks. Both the send_realloc_files and send_realloc_encrypted_files ZTS test cases were updated to create xattrs in order to force spill blocks. As part of validating an incremental receive the contents of all received xattrs are verified against the source snapshot. TODO: * requires an way to enable the optional feature OpenZFS-issue: https://www.illumos.org/issues/9952 FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277 Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#6224
When receiving an object in a send stream the receive_object() function must determine if it is an existing or new object. This is normally straight forward since that object number will usually not be allocated, and therefore it must be a new object. However, when the object exists there are two possible scenarios. 1) The object may have been freed and an entirely new object allocated. In which case it needs to be reallocated to free any attached spill block and to set the new attributes (i.e. block size, bonus size, etc). Or, 2) The object's attributes, like block size, we're modified at the source but it is the same original object. In which case only those attributes should be updated, and everything else preserved. The issue is that this determination is accomplished using a set of heuristics from the OBJECT record. Unfortunately, these fields aren't sufficient to always distinguish between these two cases. The result of which is that a change in the objects block size will result it in being reallocated. As part of this reallocation any spill block associated with the object will be freed. When performing a normal send/recv this issue will most likely manifest itself as a file with missing xattrs. This is because when the xattr=sa property is set the xattrs can be stored in this lost spill block. If this issue occurs when performing a raw send then the missing spill block will trigger an authentication error. This error will prevent the receiving side for accessing the damaged dnode block. Furthermore, if first dnode block is damaged in this way it will make it impossible to mount the received snapshot. This change resolves the issue by updating the sender to always include a SPILL record for each OBJECT record with a spill block. This allows the missing spill block to be recreated if it's freed during receive_object(). The major advantage of this approach is that it is backwards compatible with existing versions of 'zfs receive'. This means there's no need to add an incompatible feature flag which is only understood by the latest versions. Older versions of the software which already know how to handle spill blocks will do the right thing. The downside to this approach is that it can increases the size of the stream due to the additional spill blocks. Additionally, since new spill blocks will be written the received snapshot will consume more capacity. These drawbacks can be largely mitigated by using the large dnode feature which reduces the need for spill blocks. Both the send_realloc_files and send_realloc_encrypted_files ZTS test cases were updated to create xattrs in order to force spill blocks. As part of validating an incremental receive the contents of all received xattrs are verified against the source snapshot. OpenZFS-issue: https://www.illumos.org/issues/9952 FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277 Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#6224
@scineram I think that would work, as a partial solution. If the dataset does not have large blocks, but the stream does, we don't currently have a way of knowing whether that's because the earlier send split the large blocks (in which case we should reject this receive), or the earlier snapshot didn't happen to have any large blocks in it (in which case we should accept it). See my earlier comment #6224 (comment) for more thoughts on this. |
Unfortunately, it looks like
Yes. One difficulty of dealing with this issue is that the current default is not the recommended way to do sends of filesystems with large blocks, but switching to the recommended way can hit this bug. |
It seems like there is a lot of duplication of ideas and discussion going on here and elsewhere, so I'm going to attempt to consolidate and summarize all the various suggestions that have been made, and highlight the problems with each of them. This will hopefully help people to understand why this issue hasn't been resolved yet; it's not just a matter of implementing "[a solution] that is free of data loss", none of the proposed solutions avoids data loss in all cases for all users.
Personally, I would advocate that we immediately move forwards with #4; it is the only perfect long-term solution to this problem and others like it that we may encounter later on. The sooner we get the feature into zfs, the more data will have the creation_txg in it. For all existing data, I think that the best behavior is probably a combination of 1b and 2, protecting existing users with large-block datasets who have used -L, users who have small-block datasets, and requiring large-block users who have not used -L thus far to restart their sends or requiring them to deliberately use a zstream split-blocks feature. Hopefully having to use a separate command line utility is sufficiently abnormal for the zfs send workflow that it will give users pause, and prevent them from encountering this bug, especially if adequate explanation is placed in the man pages/usage messages for various commands. Unfortunately, my guess is that there are a lot of users who fall into the large-block-without-dash-L camp, and this solution will inconvenience all of them. I don't know how to fix that. If there are solutions that people proposed/like that they did not address, feel free to let me know and I will attempt to keep this up to date. If you see a way around one of the issues, or a new solution, again let me know and I will try to add/address it. Hopefully this helps people understand the current state of this bug and our attempts to come up with an ideal solution. |
Thanks for the great write-up, @pcd1193182! I realized there's a simpler way to do The ZPL already has a per-file generation number (unique for each actual file, even if they have the same object number), which it stores in the bonus buffer, which is included in the OBJECT record. We can use the generation number to tell if a file's block size changed due to toggling I think we can also orthogonally consider 1b (make -L the default), once we have the safety of the above solution, which I'm working on implementing. |
@ahrens Do we want to reject the receive in this case? Or do we just want to correctly reallocate the dnode (as the heuristic should have done)? |
After discussion with @tcaputi, I think I have a way to correctly process send streams that change to have |
@pcd1193182 Thank you for taking the time to compile the options. As far as I understand on-disk format and code the root of the issue is the details of storing objects in ObjSets:
which leads to objects being removed from the set leaving gaps in the dnode array, which are then (as I presume) greedily reused by ZFS to avoid growing that array unless absolutely needed, resulting in "object number" being ambiguous in a zfs send stream, resulting in the decision to have zfs recv detect recycling of an object number through looking for a changed record size. Did I understand that correctly so far? If yes: Do we really need that check for reuse of object numbers? Shouldn't the dead list when sending the snapshot take care of removing all stale data (that belonged to the removed object) from the destination? Apart from that... I still think that ZFS send/recv should have the ability to convert datasets from one record-/volblocksize to another. Machines get retired, pools get transferred to new and different hardware as technology advances.If ZFS is ment to be the last word in filesystems... shouldn't it have the ability to remap the block size of the data in existing datasets (without sacrificing the snapshot history) to benefit from the abilities of the destination systems hardware as much as possible? Wouldn't it make sense to be able to instruct a backup system using shingled drives to store data in the 8MB (just to pull a number from thin air) records that align with the native write stripe size of the drives, regardless of the block site of the source? And in case of a transfer of such a backup to another system where a database daemon is to access and modify the dataset in the DB native block size of (say) 16KiB... being able to recv that dataset with such a recordsize? |
That's right.
We need it so that the new object is set up correctly before we start getting WRITE records. e.g. has the correct block size and dnode size (number of "slots").
I hear that you would like us to implement that new feature. It's a reasonable thing to want, but not straightforward to implement. Right now I am prioritizing this data-loss bug. |
I agree with your priorities. My question is: will the ZPL's generation number be deterministic enough to completely solve this issue or will it just trade this issue against a more convoluted and obscure variation? |
…s -L Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#6224
Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#6224
Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#6224
…s -L Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #6224 Closes #10383
…s -L Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#6224 Closes openzfs#10383
…s -L Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #6224 Closes #10383
…s -L Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#6224 Closes openzfs#10383 (cherry picked from commit 7bcb7f0)
…s -L Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#6224 Closes openzfs#10383 (cherry picked from commit 7bcb7f0)
Sorry to hijack this but I think I've been bitten by this bug on a replication between two zfs pools (running FreeNAS). Is there any recovery path once corruption is established or is the only option to destroy the replicated pool and start from scratch? Also, pardon the necessarily stupid question but how comes that a zeroed file doesn't trigger any checksumming error upon reading? I was under the impression that zfs send/recv carry all the checksumming info necessary to confirm that what's written on the recv end exactly matches what was sent, am I mistaken? Thanks |
@f00b4r0 I'm not aware of any recovery path. The send stream is itself checksummed, but it doesn't include checksums of every block of every file. |
sigh, thanks. Not what I was hoping for, but not entirely unexpected.
This sounds like a design flaw that would have immediately exposed this bug, or am I missing something? |
@f00b4r0 If the receiving filesystem has snapshots before the corruption the data should still be in the snapshot. The trick is locating all the null fill files and the clean snapshot. I recovered millions of files from my snapshot history with some scripting work. It is best if you have a clean copy that wasn't affected by this bug and redo the replicated copy. |
@lschweiss-wustl thanks. It looks like I'll have to start from scratch. Then again that bug bites so hard, I'm not sure I can ever trust send/recv again. At least rsync doesn't silently wipe data ;P |
Sorry if asking something obvious, but I have a question: sending a dataset from a pool with In other words: if I never used Thanks. |
That's right. The bug is about changing the
In this scenario you don't have any large blocks, so it doesn't matter. You can even toggle
Correct. |
…s -L Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#6224 Closes openzfs#10383
…s -L Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#6224 Closes openzfs#10383
System information
Describe the problem you're observing
I have found a data corruption issue in zfs send. In pools using 1M recordsize, incremental sends without the -L flag sometimes silently zero out some files. The results are repeatable. Scrub does not find any errors.
Tested on Ubuntu Xenial with 0.6.5.6 and Zesty with 0.6.5.9 on the same systems.
Describe how to reproduce the problem
Source pool:
The target pool:
While the issue was observed with multiple datasets, I'll focus on a smaller one. First, the source (some uneventful snapshots omitted):
The initial send to ubackup has been performed with the -L and -e flags (not sure if this is relevant). Then performing an incremental send with the -L flag produces the expected result (size differences are due to different pool geometry):
But repeating the same send without the -L flag results in corruption:
Notice how the REFER size drops and the USEDSNAP increase between the two runs. I looked around to find where the data had gone missing and found several random files whose reported sizes on disk were reduced to 512 bytes. Reading these files seems to return the full original size but with content all zeros.
I repeated the whole process multiple times, also recreating the target pool, with the same result. The affected files are always the same, but I haven't found a common characteristic among them. Scrubbing both pools finds no errors. Syslog shows nothing uncommon. I have never noted something similar before. The only notable change I made recently was that I started using sa based xattrs in May, but that might be unrelated.
I am happy to provide more info as far as I'm able.
The text was updated successfully, but these errors were encountered: