Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get: allow downloading regular files/dirs tracked by Git #2515

Closed
jorgeorpinel opened this issue Sep 19, 2019 · 14 comments · Fixed by #2837
Closed

get: allow downloading regular files/dirs tracked by Git #2515

jorgeorpinel opened this issue Sep 19, 2019 · 14 comments · Fixed by #2837
Labels
enhancement Enhances DVC p1-important Important, aka current backlog of things to do

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 19, 2019

As we decide how to list data artifacts in external DVC repos in #2509 it became apparent listing regular files along with stage outputs could be especially useful. See #2509 (comment).

This made us think also then users could want to get some of those individual regular files after seeing them. (See #2509 (comment))

@jorgeorpinel jorgeorpinel changed the title get: allow downloading regular files tracked by Git get: allow downloading regular files tracked by Git? Sep 19, 2019
@jorgeorpinel jorgeorpinel added the question I have a question? label Sep 19, 2019
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 19, 2019

Cc @shcheklein

Why dvc get and not dvc import though? What about directory/recursive support for this feature?

Thanks

@efiop
Copy link
Contributor

efiop commented Sep 20, 2019

Related to #2507

@ghost ghost added the enhancement Enhances DVC label Sep 30, 2019
@ghost
Copy link

ghost commented Sep 30, 2019

Is there an agreement on this one, @efiop ?

@efiop

This comment has been minimized.

@jorgeorpinel

This comment has been minimized.

@efiop

This comment has been minimized.

@jorgeorpinel

This comment has been minimized.

@efiop efiop added p1-important Important, aka current backlog of things to do feature request Requesting a new feature labels Nov 4, 2019
@shcheklein shcheklein removed the feature request Requesting a new feature label Nov 4, 2019
@danihodovic
Copy link
Contributor

danihodovic commented Nov 22, 2019

A few questions regarding this issue:

What protocols should dvc get handle? The examples listed below include both http and ssh. Are there other protocols that need to be considered? Git uses 4 - https://git-scm.com/book/en/v2/Git-on-the-Server-The-Protocols.

Why dvc get and not dvc import though? What about directory/recursive support for this feature?

Have the questions above by @jorgeorpinel been resolved?

Is it supposed to some sort of compatibility with the future dvc list command? @dmpetrov mentions piping dvc list into dvc get with xargs: #2509 (comment)

Should it be possible to pipe the file to stdout for inspection with a pager such as less? Perhaps when -o is not provided. That would change the existing functionality of dvc get.

As for implementation - could it be as simple as checking if the file provided to dvc get is a not a dvc file and then downloading the file locally? I would inject the logic here:

dvc/dvc/repo/get.py

Lines 21 to 64 in 3de46dd

def get(url, path, out=None, rev=None):
out = resolve_output(path, out)
if Stage.is_valid_filename(out):
raise GetDVCFileError()
# Creating a directory right beside the output to make sure that they
# are on the same filesystem, so we could take the advantage of
# reflink and/or hardlink. Not using tempfile.TemporaryDirectory
# because it will create a symlink to tmpfs, which defeats the purpose
# and won't work with reflink/hardlink.
dpath = os.path.dirname(os.path.abspath(out))
tmp_dir = os.path.join(dpath, "." + str(shortuuid.uuid()))
try:
with external_repo(cache_dir=tmp_dir, url=url, rev=rev) as repo:
# Note: we need to replace state, because in case of getting DVC
# dependency on CIFS or NFS filesystems, sqlite-based state
# will be unable to obtain lock
repo.state = StateNoop()
# Try any links possible to avoid data duplication.
#
# Not using symlink, because we need to remove cache after we are
# done, and to make that work we would have to copy data over
# anyway before removing the cache, so we might just copy it
# right away.
#
# Also, we can't use theoretical "move" link type here, because
# the same cache file might be used a few times in a directory.
repo.cache.local.cache_types = ["reflink", "hardlink", "copy"]
o = repo.find_out_by_relpath(path)
with repo.state:
repo.cloud.pull(o.get_used_cache())
o.path_info = PathInfo(os.path.abspath(out))
with o.repo.state:
o.checkout()
except NotDvcRepoError:
raise UrlNotDvcRepoError(url)
except OutputNotFoundError:
raise OutputNotFoundError(path)
finally:
remove(tmp_dir)
.

Is there a helper to identify dvc files besides just looking at the file suffix?

@efiop
Copy link
Contributor

efiop commented Nov 22, 2019

@danihodovic

What protocols should dvc get handle? The examples listed below include both http and ssh. Are there other protocols that need to be considered? Git uses 4 - https://git-scm.com/book/en/v2/Git-on-the-Server-The-Protocols.

You mean for dvc gets url argument? Or path argument? Your reference to git makes me think like you are talking about url. If so, we don't really care, as we are passing it to gitpython anyway and whatever it supports - it supports :) But yeah, local, http and ssh are regular ones. But that doesn't really matter for this issue, unless I'm missing something. What would make sense is a path part, and we should only handle local for now, same as for cached arguments. 🙂

@efiop
Copy link
Contributor

efiop commented Nov 22, 2019

@danihodovic

Have the questions above by @jorgeorpinel been resolved?

Import doesn't support regular files IIRC. Cached directories are supported, so it would be great to support non-cached ones too as a part of this ticket 🙂 Recursive option doesn't exist for any of the commands, so it is not part of this ticket (though I'm not even sure what @jorgeorpinel meant by recursive, maybe he could clarify)

Is it supposed to some sort of compatibility with the future dvc list command? @dmpetrov mentions piping dvc list into dvc get with xargs: #2509 (comment)

Well, it is more about dvc list then dvc get. No need to keep dvc list in mind for this issue, unless I'm missing something.

Should it be possible to pipe the file to stdout for inspection with a pager such as less? Perhaps when -o is not provided. That would change the existing functionality of dvc get.

Not required and I don't really see a need for that functionality. Also, how would that even work with directories? 🙂It is not the functionality meant for dvc get.

As for implementation - could it be as simple as checking if the file provided to dvc get is a not a dvc file and then downloading the file locally? I would inject the logic here:
...
Is there a helper to identify dvc files besides just looking at the file suffix?

Not sure what you mean, Dvcfiles are not supplied to dvc get as an argument, output paths are.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Nov 22, 2019

Agree with Ruslan. We would basically be using Git to download files in the repo so we will support everything Git does. I didn't get the part about path though, @efiop. The path is always local/internal to the repo's structure, is that what you mean?

So dvc get just has to detect whether this path represents a file in the repo or a file referenced in a DVC-file in the repo (the latter is what's currently supported) – this should help answer your Q about implementation @danihodovic.

Why dvc get and not dvc import though? What about directory/recursive support for this feature?

These Qs haven't really been decided Dani, thanks for checking. I vote to work on dvc get first but this should also apply to dvc import. I also vote to not complicate things by trying to support directories (recursively or not) for now. We can revisit later.

Cached directories are supported.

What do you mean @efiop? Not sure the cache has much to do with this issue. It's really about files hosted in the Git repo vs. files referenced in the DVC-files hosted in the Git repo, right?

Also agree with Ruslan about dvc list. It's related but separate from this, no need to worry about that rn. Agree as well about no need to pipe the file to stdout.

@efiop
Copy link
Contributor

efiop commented Nov 22, 2019

What do you mean @efiop? Not sure the cache has much to do with this issue. It's really about files hosted in the Git repo vs. files referenced in the DVC-files hosted in the Git repo, right?

I was talking about outputs of dvc files that are not cached by dvc, so possibly tracked by git @jorgeorpinel . E.g. dvc run -O dir 'mkdir dir && echo data > dir/data'(notice the big O). The handling sis pretty much the same, see implementation in api.

@efiop
Copy link
Contributor

efiop commented Nov 22, 2019

Agree with Ruslan. We would basically be using Git to download files in the repo so we will support everything Git does. I didn't get the part about path though, @efiop. The path is always local/internal to the repo's structure, is that what you mean?

Yes, that is what I was saying. Note that we might have external outputs (e.g. s3://bucket/data), but those are not supported and shouldn't be for now, so I was just clarifying.

@jorgeorpinel
Copy link
Contributor Author

Kudos to @danihodovic for PR #2837 which seems to get most of this resolved, even including support for downloading entire directories from Git. Just pending to check whether dvc import will have the same capabilities after that PR (of if we need a follow-up issue for it).

@jorgeorpinel jorgeorpinel changed the title get: allow downloading regular files tracked by Git? get: allow downloading regular files tracked by Git Nov 28, 2019
@jorgeorpinel jorgeorpinel removed the question I have a question? label Nov 28, 2019
@jorgeorpinel jorgeorpinel changed the title get: allow downloading regular files tracked by Git get: allow downloading regular files/dirs tracked by Git Nov 28, 2019
danihodovic added a commit to danihodovic/dvc that referenced this issue Dec 2, 2019
Allows `dvc get` to copy regular files or directories.

fixes: iterative#2515
danihodovic added a commit to danihodovic/dvc that referenced this issue Dec 5, 2019
Allows `dvc get` to copy regular files or directories.

fixes: iterative#2515
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC p1-important Important, aka current backlog of things to do
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants