Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvcfs: initial docs #3932

Merged
merged 6 commits into from
Sep 19, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions content/docs/api-reference/dvcfs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# DvcFileSystem

DvcFileSystem provides a pythonic file interface (aka
skshetry marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it can be further simplified to stress the main points that it's read-only and fsspec comapitable. I think it should start with something like, "DVCFileSystem is a read-only fsspec-compatible interface." Then you could go on to describe the typical operations and everything else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I wrote like that, but I later thought that the end users does not care about fsspec or need to care about it to use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read-only part is important

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I wrote like that, but I later thought that the end users does not care about fsspec or need to care about it to use it.

Good point, but if we are not documenting the available methods, we need to point out clearly that it implements the read-only methods available in fsspec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to first focus on it being a filesytem-like interface and read-only, you could say something like: "DvcFileSystem provides read-only operations like ls, du, glob, get, download, etc. for a DVC repo. The API implements most read-only methods of fsspec."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the end users does not care about fsspec or need to care about it to use it.

So why do we even mention it?

Copy link
Member Author

@skshetry skshetry Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an fsspec-based filesystem. We do need to mention it, I just don't think we should lead with that in the intro. 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Please see #3932 (comment) below. Thanks

[fsspec](https://filesystem-spec.readthedocs.io/)) for a DVC repo and provides a
skshetry marked this conversation as resolved.
Show resolved Hide resolved
read-only filesystem-like operations like `ls`, `du`, `glob`, `get`, `download`,
etc.

DvcFileSystem provides a single view of all the files/directories in your
repository, be it Git-tracked or DVC-tracked, or untracked (in case of a local
repository). DvcFileSystem is smart to reuse the files in your cache directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally it avoids downloading DVC-tracked files unless it's needed. It's using DVC cache [make it abbr] to avoid downloading objects every time and support streaming for supported remote storages [link] for operations like ...

(if present) and can otherwise stream/fetch the files from your default remote.

### Basic Usage
skshetry marked this conversation as resolved.
Show resolved Hide resolved

```py
>>> from dvc.fs.dvc import _DvcFileSystem as DvcFileSystem
skshetry marked this conversation as resolved.
Show resolved Hide resolved
# opening a local repository
>>> fs = DvcFileSystem("/path/to/local/repository")
dberenbaum marked this conversation as resolved.
Show resolved Hide resolved
# opening a remote repository
>>> remote_fs = DvcFileSystem("https://github.com/iterative/example-get-started.git", rev="main")
```
skshetry marked this conversation as resolved.
Show resolved Hide resolved

## Examples
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
skshetry marked this conversation as resolved.
Show resolved Hide resolved

### Listing all DVC-tracked files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: make title look like command/calls - ls: list DVC-tracked files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snippet here uses find, not ls. The title should probably reflect that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @daavoo said, it's equivalent to ls --recursive, so I don't want to force commands in the title. I do mention that after the snippets now. :)


```py
>>> fs = DvcFileSystem("https://github.com/iterative/example-get-started.git", rev="main")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be worth mentioning in those examples important aspects of their behavior - e.g. it doesn't download files, streams, etc. Can be brief

>>> fs.find("/", detail=False, dvc_only=True)
[
'/data/data.xml',
'/data/features/test.pkl',
'/data/features/train.pkl',
'/data/prepared/test.tsv',
'/data/prepared/train.tsv',
'/evaluation/importance.png',
'/model.pkl'
]
```

### Downloading a file or a directory

```py
>>> fs = DvcFileSystem("https://github.com/iterative/example-get-started.git", rev="main")
>>> fs.get_file("data/data.xml", "data.xml")
```

This downloads "data/data.xml" file to the current working directory as
"data.xml" file. The DVC-tracked files may be downloaded from the cache if it
exists or may get streamed from the remote.

```py
>>> fs = DvcFileSystem("https://github.com/iterative/example-get-started.git", rev="main")
>>> fs.get("data", "data", recursive=True)
```

This downloads all the files in "data" directory - be it Git-tracked or
DVC-tracked into a local directory "data". Similarly, DVC might fetch files from
remote if they don't exist in the cache.

## API Reference
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really seeing much value from "API reference" right now compared to Scenarios/Recipes that we can cover.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goes back to #3927 (comment). If we have the API ref auto-generated then I agree. Or do you suggest skipping it completely and let people/IDEs check docstrings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsspec is supposed to be a documented API that this library implements, we don't want to repeat everything here. Also, it makes sense to start small, with examples. I hope a lot of things here will be self descriptive through examples.


As DvcFileSystem is based on [fsspec](https://filesystem-spec.readthedocs.io/),
it is compatible with most of the APIs that it offers. Please check the fsspec's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it is compatible with most of the APIs that it offers. Please check the fsspec's
it is compatible with most of the read-only methods that it offers. Please check the fsspec's

Copy link
Member Author

@skshetry skshetry Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the API that is read-only, but the filesystem and its operation in general. I have also struggled to explain this, but tbh I don't think we need to mention anything other than "... is a read-only filesystem".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to even skip it and raise a EROFS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an EROFS?

I must be missing your point here because I don't see what's confusing or ambiguous. Is dvcfs compatible with the read-only methods of https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EROFS is an errno which is raised for "Read Only FileSystem". What I am trying to say is that the filesystem in general is read-only, not the methods. Take fs.open() for example. It works with mode="rb" but mode="wb" raises EROFS. So I am trying to distinguish between the method open vs the operations: writing/reading, etc.

[API Reference](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem)
for more details.
4 changes: 4 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -520,6 +520,10 @@
"label": "Python API Reference",
"source": "api-reference/index.md",
"children": [
{
"slug": "dvcfs",
"label": "DvcFileSystem"
skshetry marked this conversation as resolved.
Show resolved Hide resolved
},
{
"slug": "get_url",
"label": "get_url()"
Expand Down