-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce dvc list
command
#3246
Conversation
Hi. Will, |
Hi @jorgeorpinel
If it needs I can add |
@JIoJIaJIu @jorgeorpinel |
7e94ab0
to
60c4fb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/unit/command/.test_list.py.swo
looks like some garbage made its way into your patch 🙂
@JIoJIaJIu Please rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good @JIoJIaJIu ! I've tried running this a few times and it works as requested 🙂 Let's add tests and clean this up.
Btw, it should work with git-only repos as well, since
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest two things to clarify/simplify this:
- do not support multiple targets
- understand target path as relative to repo root, same as in get and import
dvc/repo/ls.py
Outdated
def in_target(out, target_path_info, is_dir): | ||
if out.scheme == "local" and out.path_info == target_path_info: | ||
return True | ||
|
||
if is_dir: | ||
if recursive and out.path_info.isin(target_path_info): | ||
return True | ||
if out.path_info.isin(target_path_info, 1): | ||
return True | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplicates Repo.find_outs_by_path(..., strict=False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, one thing I wanted to clarify
dvc list repo target_dir
should return like ?
target_dir
I don't see the use case of such behavior for the user
I think that if user pointes the folder in the target it expect to see the content of it like (without -R
flag)
target_dir/file1
target_dir/subfolder
and when he points -R
flag it means to check subfolder
target_dir/file1
target_dir/subfolder
target_dir/subfolder/file2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good question. But to answer it, I would look into behavior of existing commands. For example ls target_dir
and aws s3 ls s3://bucket/target_dir
will show just target_dir
. So I would stick to that. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding -R
- please, let's not use it. Let's use a regular ls-like, aws-ls like -r
, --recusrsive
. Along with some flag to filter only outputs.
Also, good question if we want and how do we want to show DVC-files along with the data they reference to.
Also, @JIoJIaJIu could you please share https://asciinema.org/ with different outputs to make it easier to review it? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ls
shows content of the folder by default, but supports -d
option for listing dirs rather then content
aws s3 ls s3://bucket/target_dir
also lists dir and with slash at the end it shows content
aws s3 ls s3://bucket/target_dir/
Based on discord communication we decided to implement showing taget_dir
by default as ls -d
does.
It allows user to check does the folder exists in the repo or not and make the api more simple
$ dvc list url target_dir
target_dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein Let's use -R
, same as in other commands. Otherwise it will overlap with -r|--remote
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That cause I've implemented multiple targets support I didn't want to traverse all stages and outs on every target request. So it is a kind of optimization
Let's not optimize it that way then 🙂 Let's only accept one target and use find_outs_by_path
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, yes, I see that ls has -R
indeed. Disregard my comment on this @JIoJIaJIu!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JIoJIaJIu hm, though dvc list url .
is a corner case of that approach 🙂 A natural behavior for .
is to treat it as ./
. So maybe I'm wrong there and we should consider it more carefully. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop
I think that dvc list dir
always should show content of the dir like
$ dvc list dir
file1
subdir
and apply behavior of ls -d
only with target
for checking does it exist or not
$ dvc list dir subdir
subdir
dvc/repo/ls.py
Outdated
result = list(set(result)) | ||
result.sort() | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return sorted(set(result))
Also, why set? Do we expect them to be duplicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check, I remember that faced with duplicated rows, but maybe it was an issue in the logic if outs cannot be duplicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got how it can happen - the output can be fetched/generated and exists on the fs. Hence it will be returned from _ls_files
and from _ls_outs
also which leads to duplicates in the output
On single/multiple targets: we don't have any agreement of how this should look for multiple targets. We opted for implementing a flat list, but that only works for single one. See how |
The last commit contains only changes requested by @efiop , for your notes @Suor I will provide in the next one
I think about targets as a filter that applied on the output, so multiple targets just means multiple filter, similar to
but that cares about folder structure
What do you think? UPD: I decided to get rid of multiple targets support |
for dirpath, dirnames, filenames in tree.walk(target_path): | ||
files.extend(map(lambda f: PathInfo(dirpath, f), filenames)) | ||
if not recursive: | ||
files.extend(map(lambda d: PathInfo(dirpath, d), dirnames)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the cause of that bug. This line should be outside of if
, right?
It's not a bug but a feature))
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good in general. Tried and works well (see below for an edgecase). I'd like to have a more few tests for external repo (git-only + dvc) with possible combinations such as rev, etc. The tests you have added are good, but mostly covers the local dvc repo cases only. 🙂
dvc/command/ls.py
Outdated
recursive=self.args.recursive, | ||
outputs_only=self.args.outs_only, | ||
) | ||
logger.info("\n".join(nodes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are no nodes, logger.info
will add a newline on the output.
tests/func/test_ls.py
Outdated
m = mocker.patch("dvc.scm.Git.clone") | ||
|
||
def clone_repo(url, to_path, *args): | ||
assert url == "git://github.com:repo" | ||
copy_tree(fspath(tmp_dir), to_path) | ||
return Git(fspath(tmp_dir)) | ||
|
||
m.side_effect = clone_repo | ||
|
||
files = Repo.ls("git://github.com:repo") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to test with external repo, you can use erepo_dir
fixture to create an external dvc repo and git_dir
for external git-only repository. If you want to test with url, you can use file://
based urls as well. No need to mock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if file://
works these days. In any case, the logic still passes this to external_repo
, so we could use local paths here with the same effect. I know that previously we've had a different implementation here for which the url in tests mattered, but it is no longer like that, so we could simplify and use erepo_dir
as @skshetry suggested 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor always adds file://
for external repo in combination with erepo_dir
with# Use URL to prevent any fishy optimizations
kind-of comments. So, that seems to work.
dvc/tests/func/test_external_repo.py
Lines 48 to 50 in 7821fa6
# Use URL to prevent any fishy optimizations | |
url = "file://{}".format(erepo_dir) | |
with external_repo(url) as repo: |
@skshetry Good point! This patch only uses |
I would like to pay attention on the command behaviour when
But when we request
|
@JIoJIaJIu Do you think we should raise an error there? 🙂 |
@efiop I see that we shouldn't. Idea is that target exists into the repo, but this target doesn't contain
The behavior with throwing an error as |
@JIoJIaJIu But I think it is better to illustrate it with an example. Imagine that you are writing a shell script (with
and if there is no
but now your |
dvc/repo/ls.py
Outdated
|
||
if not outs_only: | ||
result.extend(files) | ||
target_path_info = _get_target_path_info(repo, target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure you can move this up and just pass target_path_info instead of target
to other methods. Currently you are calling this helper three times for no reason 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently you are calling this helper three times for no reason
The idea was to keep function interface lightweight and simple - _ls_outs
and _ls_files
requires only repo
which is transparent and make the functions more simple to use. With target_path_info
param I don't like much how functions signature looks
@efiop I've applied the changes. I think it's a philosophy question and a few ways are possible. TLDRAs you know the idea of throwing an error on not existed target is not close to me in general
we've chosen such behavior as a requirement, so the error in the script shows that the requested source is invalid
here, the requested source is valid, but there are no outs there - what is the different case. So if our script should get 1+ outputs on list it should be solved separately and
shouldn't throw if there are no outs on path but path exists |
a5ca20c
to
4b3b5a9
Compare
@JIoJIaJIu Thanks for the explanation! I don't think it justifies the behaviour inconsistency from my example though 🙂 I would understand your approach if we would've stuck to filter-like approach in the base case, where we would not error-out and would simply print nothing, but I think it contradicts itself in its current form as it mismatches the approaches and results in an inconsistency between those two cases. I feel like the misunderstanding here is in the way we understand the ordering of operations here. You seem to be thinking about But let's consider your view. We've established that in default case |
dvc/repo/ls.py
Outdated
result.extend(_ls_files_repo(target_path_info, recursive)) | ||
|
||
if target and not result: | ||
raise PathMissingError(target, repo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record: discussing improving this error in --outs-only
case in PMs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! Let's fix up that exception message and we'll merge and will address the rest later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @JIoJIaJIu ! 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Late to this party but reviewing after the fact given that we'll also take over iterative/dvc.org/pull/967 now. I can probably address my own review here as well but feel free to do so or give any feedback.
- Update also autocomplete scripts and tests.
|
||
|
||
def add_parser(subparsers, parent_parser): | ||
LIST_HELP = "List files and DVC outputs in the repo." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LIST_HELP = "List files and DVC outputs in the repo." | |
List repository contents, including files and directories tracked by DVC and by Git. |
help="Supported urls:\n" | ||
"/path/to/file\n" | ||
"/path/to/directory\n" | ||
"C:\\\\path\\to\\file\n" | ||
"C:\\\\path\\to\\directory\n" | ||
"https://github.com/path/to/repo\n" | ||
"[email protected]:path/to/repo.git\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help="Location of DVC repository to list"
similar to dvc import
. Should be obvious for most people what it means and we'll explain it in docs (again, same as with get/import).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW the args in the -h
output of get and import is slightly different (2 lines vs 1 line):
$ dvc get -h
...
positional arguments:
url Location of DVC project or Git repository to download
from
path Path to a file or directory within the project or
repository
$ dvc import -h
...
positional arguments:
url Location of DVC project or Git repository to download from
path Path to a file or directory within the project or repository
def __init__(self, path, repo): | ||
default_msg = ( | ||
"The path '{}' does not exist in the target repository '{}'" | ||
" neither as an output nor a git-handled file." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...as a DVC output nor as a Git-tracked file."
) | ||
default_msg_output_only = ( | ||
"The path '{}' does not exist in the target repository '{}'" | ||
" as an output." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as an DVC output
Provides
dvc list
commandUses sparse checkout for remote repos, but missed opportunity to list artifacts in repo by SHA commit (only by refs). Will be fixed in next commitsAs an alternative I looked to the partial-clone but have chosen sparse checkout as the simplest solution, open for suggestions
targets params filter artifacts/outputs and not dvc-file locations
Close #2509
❗ Have you followed the guidelines in the Contributing to DVC list?
📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.
Sufficient test coverage is provided
Support SHA commits as rev
There are shell completion scripts