-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update resume token at object receive. #15927
Conversation
Before this change resume token was updated only on data receive. Usually it is enough to resume replication without much overlap. But we've got a report of a curios case, where replication source was traversed with recursive grep, which through enabled atime modified every object without modifying any data. It produced several gigabytes of replication traffic without a single data write and so without a single resume point. While the resume token was not designed to resume from an object, I've found that the send implementation always sends object before any data. So by requesting resume from offset 0 we are effectively resuming from the object, followed (or not) by the data at offset 0, just as we need it. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed-by: Allan Jude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth putting a comment in send_cb
in the send code to say that we're actually relying on this behavior now; it's a somewhat natural result of the way dataset traversing works (When resuming from blkid 0 in an object, we haven't completed the whole subtree below the object, so the traverse code visits it, and we don't actually do any checks for where we're resuming from in the send code itself), but it's possible someone might want tweak that in the future, reintroducing this bug. That or we could add a zfs suite test to verify that this works, though I think it would necessarily need to be a little flaky unless we can manually make/tweak the resume token to get the effect we want.
Before this change resume token was updated only on data receive. Usually it is enough to resume replication without much overlap. But we've got a report of a curios case, where replication source was traversed with recursive grep, which through enabled atime modified every object without modifying any data. It produced several gigabytes of replication traffic without a single data write and so without a single resume point. While the resume token was not designed to resume from an object, I've found that the send implementation always sends object before any data. So by requesting resume from offset 0 we are effectively resuming from the object, followed (or not) by the data at offset 0, just as we need it. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15927
Before this change resume token was updated only on data receive. Usually it is enough to resume replication without much overlap. But we've got a report of a curios case, where replication source was traversed with recursive grep, which through enabled atime modified every object without modifying any data. It produced several gigabytes of replication traffic without a single data write and so without a single resume point. While the resume token was not designed to resume from an object, I've found that the send implementation always sends object before any data. So by requesting resume from offset 0 we are effectively resuming from the object, followed (or not) by the data at offset 0, just as we need it. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15927
Before this change resume token was updated only on data receive. Usually it is enough to resume replication without much overlap. But we've got a report of a curios case, where replication source was traversed with recursive grep, which through enabled atime modified every object without modifying any data. It produced several gigabytes of replication traffic without a single data write and so without a single resume point. While the resume token was not designed to resume from an object, I've found that the send implementation always sends object before any data. So by requesting resume from offset 0 we are effectively resuming from the object, followed (or not) by the data at offset 0, just as we need it. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15927
Before this change resume token was updated only on data receive. Usually it is enough to resume replication without much overlap. But we've got a report of a curios case, where replication source was traversed with recursive grep, which through enabled atime modified every object without modifying any data. It produced several gigabytes of replication traffic without a single data write and so without a single resume point. While the resume token was not designed to resume from an object, I've found that the send implementation always sends object before any data. So by requesting resume from offset 0 we are effectively resuming from the object, followed (or not) by the data at offset 0, just as we need it. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15927
Before this change resume token was updated only on data receive. Usually it is enough to resume replication without much overlap. But we've got a report of a curios case, where replication source was traversed with recursive grep, which through enabled atime modified every object without modifying any data. It produced several gigabytes of replication traffic without a single data write and so without a single resume point. While the resume token was not designed to resume from an object, I've found that the send implementation always sends object before any data. So by requesting resume from offset 0 we are effectively resuming from the object, followed (or not) by the data at offset 0, just as we need it. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15927
Before this change resume token was updated only on data receive. Usually it is enough to resume replication without much overlap. But we've got a report of a curios case, where replication source was traversed with recursive grep, which through enabled atime modified every object without modifying any data. It produced several gigabytes of replication traffic without a single data write and so without a single resume point.
While the resume token was not designed to resume from an object, I've found that the send implementation always sends object before any data. So by requesting resume from offset 0 we are effectively resuming from the object, followed (or not) by the data at offset 0, just as we need it.
How Has This Been Tested?
I was able to recreate original problem, seeing no resume token updates during long stream receive without data. With the patch I see token updates, can resume replication using them and see that resumed stream starts from the expected object.
Types of changes
Checklist:
Signed-off-by
.