-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transfer: preserve workspace data #6860
Conversation
1cf7ba2
to
ef34689
Compare
de41c3d
to
2c79d7d
Compare
ff5eb57
to
37d0ad4
Compare
2e80af9
to
b44451a
Compare
@@ -121,6 +121,16 @@ def _cache_is_copy(cache, path_info): | |||
return cache.cache_types[0] == "copy" | |||
|
|||
|
|||
def _relink(cache, cache_info, fs, path_info, in_cache, force): | |||
_remove(path_info, fs, in_cache, force=force) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing we could do here is move original file aside (e.g. .data.back
) so that even if _link
fails - we could bring it back. Clearly there is a storage cost for it though. Though we don't relink copies anyway, so it should actually be fine. Though this should be done carefully. Will need to look into it after this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing we could do here is move original file aside (e.g. .data.back) so that even if _link fails - we could bring it back. Clearly there is a storage cost for it though. Though we don't relink copies anyway, so it should actually be fine. Though this should be done carefully. Will need to look into it after this PR.
But here we only create a link and that would not cost too much storage. Even in the worst condition the overall storage cost would only be a little more than before the operation.
Btw,
def _do_link(cache, from_info, to_info, link_method):
if cache.fs.exists(to_info):
cache.fs.remove(to_info) # broken symlink
link_method(from_info, to_info)
logger.debug(
"Created '%s': %s -> %s", cache.cache_types[0], from_info, to_info
)
Looks similar to it.
e00b3a7
to
464400b
Compare
Currently it depends on the whole `info()` output, which has too much information that doesn't really matter. To check if file contents didn't change mtime, size, inode combo is enough.
`os.link` is broken across platforms when dealing with symlinks. See https://bugs.python.org/issue41355 for more info.
Depending on a filesyste (e.g. on NTFS), `_remove` might reset read-only permissions in order to delete a hardlink to protected object, which will also reset it for the object itself, making it unprotected, so we need to protect it back.
Currently whether or not link confirmation will run depends on particular filesystem and existing workspace link type, which introduces inconsistency in behaviour, where we sometimes catch unsupported links before touching user data and sometimes we don't.
41ac816
to
ffa7650
Compare
@@ -14,27 +14,48 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
def _link( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will unify this with the stuff we do in checkout
in the followups, since it is out of scope.
Currently, we use
move
to transfer data from the workspace to the local odb. This means that if for some reason we've failed to transfer one file from the dataset, all of the completed files would disappear from the workspace and, because we save.dir
last, there will be not enough metadata around to recover the original dataset structure.This PR, makes us try to use
reflink
or ahardlink
instead ofmove
to keep the workspace structure intact, while also keeping the odb complete. So if the same type of failure occurs, the worst that could happen is that some of the files in the dataset will be inread-only
mode, which is easily recoverable. This also allows us to introduce recovery functionality in the future, that could both rollback thedvc add
or just continue it from where it left.The main reason why we've used
move
was performance, as it allowed us to just create the links from scratch when doing thecheckout
part. Now, with the introduction ofdvc.objects.diff
, we can be smart about it and only relink if it makes sense. The details still will be optimized in the near future though.Transferring in terms of
copy
operations before and after:So if we are on some sketchy fs where even
hardlink
doesn't work, butrename
does - we'll indeed have tocopy
, but user's workspace stays intact and this is more important fortransfer
. Butdvc add
consists oftransferring
andcheckout
, so if user is using defaultcache.type
(reflink or copy) - we'll have tocopy
from cache back to workspace aftermove
anyway, so overall we will do the same 1 copy operation. On the other hand, after this PR we have a new situation when we might've used ahardlink
duringtransfer
ing but, the user wants asymlink
so we'll need to relink, which is pretty cheap andcheckout
is already doing those checks fordvc checkout --relink
anyway.Overall, in order to keep the user workspace safer, this PR sacrifices
hardlink
ing overhead in some cases, while also makingtransfer
ing operation much more correct by not being destructive.Also, please note, that same as
object.diff
, this will benefit from path_info removal and c/cython extensions in the near future.Partial mitigation for #6387