Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transfer: preserve workspace data #6860

Merged
merged 5 commits into from
Oct 27, 2021
Merged

transfer: preserve workspace data #6860

merged 5 commits into from
Oct 27, 2021

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Oct 25, 2021

Currently, we use move to transfer data from the workspace to the local odb. This means that if for some reason we've failed to transfer one file from the dataset, all of the completed files would disappear from the workspace and, because we save .dir last, there will be not enough metadata around to recover the original dataset structure.

This PR, makes us try to use reflink or a hardlink instead of move to keep the workspace structure intact, while also keeping the odb complete. So if the same type of failure occurs, the worst that could happen is that some of the files in the dataset will be in read-only mode, which is easily recoverable. This also allows us to introduce recovery functionality in the future, that could both rollback the dvc add or just continue it from where it left.

The main reason why we've used move was performance, as it allowed us to just create the links from scratch when doing the checkout part. Now, with the introduction of dvc.objects.diff, we can be smart about it and only relink if it makes sense. The details still will be optimized in the near future though.

Transferring in terms of copy operations before and after:

move {ref,hard}link
same fs same as rename, no copy no copy
same fs, no link no copy copy
different fs copy copy

So if we are on some sketchy fs where even hardlink doesn't work, but rename does - we'll indeed have to copy, but user's workspace stays intact and this is more important for transfer. But dvc add consists of transferring and checkout, so if user is using default cache.type(reflink or copy) - we'll have to copy from cache back to workspace after move anyway, so overall we will do the same 1 copy operation. On the other hand, after this PR we have a new situation when we might've used a hardlink during transfering but, the user wants a symlink so we'll need to relink, which is pretty cheap and checkout is already doing those checks for dvc checkout --relink anyway.

Overall, in order to keep the user workspace safer, this PR sacrifices hardlinking overhead in some cases, while also making transfering operation much more correct by not being destructive.

Also, please note, that same as object.diff, this will benefit from path_info removal and c/cython extensions in the near future.

Partial mitigation for #6387

@efiop efiop requested a review from a team as a code owner October 25, 2021 00:02
@efiop efiop requested a review from skshetry October 25, 2021 00:02
@efiop efiop added the enhancement Enhances DVC label Oct 25, 2021
@efiop efiop force-pushed the fix-6387 branch 3 times, most recently from 1cf7ba2 to ef34689 Compare October 25, 2021 00:40
@efiop efiop mentioned this pull request Oct 25, 2021
@efiop efiop force-pushed the fix-6387 branch 10 times, most recently from de41c3d to 2c79d7d Compare October 26, 2021 12:39
@efiop efiop self-assigned this Oct 26, 2021
@efiop efiop added p0-critical Critical issue. Needs to be fixed ASAP. bugfix fixes bug labels Oct 26, 2021
@efiop efiop force-pushed the fix-6387 branch 10 times, most recently from ff5eb57 to 37d0ad4 Compare October 26, 2021 21:01
@efiop efiop force-pushed the fix-6387 branch 7 times, most recently from 2e80af9 to b44451a Compare October 26, 2021 22:56
@@ -121,6 +121,16 @@ def _cache_is_copy(cache, path_info):
return cache.cache_types[0] == "copy"


def _relink(cache, cache_info, fs, path_info, in_cache, force):
_remove(path_info, fs, in_cache, force=force)
Copy link
Contributor Author

@efiop efiop Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we could do here is move original file aside (e.g. .data.back) so that even if _link fails - we could bring it back. Clearly there is a storage cost for it though. Though we don't relink copies anyway, so it should actually be fine. Though this should be done carefully. Will need to look into it after this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we could do here is move original file aside (e.g. .data.back) so that even if _link fails - we could bring it back. Clearly there is a storage cost for it though. Though we don't relink copies anyway, so it should actually be fine. Though this should be done carefully. Will need to look into it after this PR.

But here we only create a link and that would not cost too much storage. Even in the worst condition the overall storage cost would only be a little more than before the operation.

Btw,

def _do_link(cache, from_info, to_info, link_method):
    if cache.fs.exists(to_info):
        cache.fs.remove(to_info)  # broken symlink
    link_method(from_info, to_info)
    logger.debug(
        "Created '%s': %s -> %s", cache.cache_types[0], from_info, to_info
    )

Looks similar to it.

@efiop efiop force-pushed the fix-6387 branch 2 times, most recently from e00b3a7 to 464400b Compare October 27, 2021 01:06
efiop and others added 4 commits October 27, 2021 04:09
Currently it depends on the whole `info()` output, which has too much
information that doesn't really matter. To check if file contents didn't
change mtime, size, inode combo is enough.
`os.link` is broken across platforms when dealing with symlinks.

See https://bugs.python.org/issue41355 for more info.
Depending on a filesyste (e.g. on NTFS), `_remove` might reset
read-only permissions in order to delete a hardlink to protected object,
which will also reset it for the object itself, making it unprotected,
so we need to protect it back.
Currently whether or not link confirmation will run depends on
particular filesystem and existing workspace link type, which introduces
inconsistency in behaviour, where we sometimes catch unsupported links
before touching user data and sometimes we don't.
dvc/fs/utils.py Outdated Show resolved Hide resolved
@efiop efiop force-pushed the fix-6387 branch 2 times, most recently from 41ac816 to ffa7650 Compare October 27, 2021 15:49
@efiop efiop changed the title [WIP] transfer: preserve workspace data transfer: preserve workspace data Oct 27, 2021
@@ -14,27 +14,48 @@
logger = logging.getLogger(__name__)


def _link(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will unify this with the stuff we do in checkout in the followups, since it is out of scope.

@efiop efiop changed the title transfer: preserve workspace data [WIP] transfer: preserve workspace data Oct 27, 2021
@efiop efiop changed the title [WIP] transfer: preserve workspace data transfer: preserve workspace data Oct 27, 2021
@efiop efiop merged commit 922eb20 into iterative:master Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes bug enhancement Enhances DVC p0-critical Critical issue. Needs to be fixed ASAP.
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants