Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvcfs: initial docs #3932

Merged
merged 6 commits into from
Sep 19, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions content/docs/api-reference/dvcfilesystem.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# DVCFileSystem

_New in DVC 2.27_

DVCFileSystem provides a pythonic file interface (
[fsspec-compatible](https://filesystem-spec.readthedocs.io/)) for a DVC repo. It
is a read-only filesystem, hence it does not support any write operations, like
`put_file`, `cp`, `rm`, `mv`, `mkdir` etc.
Comment on lines +5 to +8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼

Suggested change
DVCFileSystem provides a pythonic file interface (
[fsspec-compatible](https://filesystem-spec.readthedocs.io/)) for a DVC repo. It
is a read-only filesystem, hence it does not support any write operations, like
`put_file`, `cp`, `rm`, `mv`, `mkdir` etc.
`DVCFileSystem` is a Python class that provides a file interface for <abbr>DVC repositories</abbr>
(compatible with [fsspec](https://filesystem-spec.readthedocs.io/)). It
is read-only, hence it does not support any write operations like
`put_file`, `cp`, `rm`, `mv`, `mkdir` etc.


DVCFileSystem provides a unified view of all the files/directories in your
repository, be it Git-tracked or DVC-tracked, or untracked (in case of a local
repository). It can reuse the files in DVC <abbr>cache</abbr> and can otherwise
stream from
[supported remote storage](/doc/command-reference/remote/add#supported-storage-types).
Comment on lines +10 to +14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DVCFileSystem provides a unified view of all the files/directories in your
repository, be it Git-tracked or DVC-tracked, or untracked (in case of a local
repository). It can reuse the files in DVC <abbr>cache</abbr> and can otherwise
stream from
[supported remote storage](/doc/command-reference/remote/add#supported-storage-types).
`DVCFileSystem` objects construct a unified view of all the files and directories in your
repo, be it Git-tracked, DVC-tracked, or untracked (in the case of local
repos). It can reuse files in the <abbr>DVC cache</abbr>, and it can otherwise
stream from [supported remote storage].
[supported remote storage]: /doc/command-reference/remote/add#supported-storage-types

It can reuse files in the DVC cache

What does this mean though? Reuse them for what purpose?


```py
>>> from dvc.api import DVCFileSystem
# opening a local repository
>>> fs = DVCFileSystem("/path/to/local/repository")
# opening a remote repository
>>> url = "https://github.com/iterative/example-get-started.git"
>>> fs = DVCFileSystem(url, rev="main")
Comment on lines +16 to +22
Copy link
Contributor

@jorgeorpinel jorgeorpinel Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼

Suggested change
```py
>>> from dvc.api import DVCFileSystem
# opening a local repository
>>> fs = DVCFileSystem("/path/to/local/repository")
# opening a remote repository
>>> url = "https://github.com/iterative/example-get-started.git"
>>> fs = DVCFileSystem(url, rev="main")
```py
from dvc.api import DVCFileSystem
# Opening a local repository
fs = DVCFileSystem("/path/to/local/repository")
# Opening a remote repository
url = "https://github.com/iterative/example-get-started.git"
fs = DVCFileSystem(url, rev="main")

Applies to other py code blocks.

```

The optional positional argument can be a URL or a local path to the DVC
project. If unspecified, the DVC project in current working directory is used.
Comment on lines +25 to +26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work with any DVC project or does it have to be a Git repo? (We currently use the term "repo" specifically in a few places.)


The optional `rev` argument can be passed to open a filesystem from a certain
Git commit (any [revision](https://git-scm.com/docs/revisions) such as a branch
or a tag name, a commit hash, or an [experiment name]).

[experiment name]: /doc/command-reference/exp/run#-n
Comment on lines +25 to +32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to put the class signature up somewhere? With param names, etc. (may need to make a mixed definition combined with FileSystem)


## Opening a file

```py
>>> with fs.open("model.pkl") as fobj:
model = pickle.load(fobj)
```

This is similar to `dvc.api.open()` which returns a file-like object. Note that,
unlike `dvc.api.open()`, the `mode` defaults to binary mode, i.e. `"rb"`. You
can also specify `encoding` argument in case of text mode (`"r"`).

## Reading a file

```py
>>> contents = fs.cat_file("get-started/data.xml")
```

This is similar to `dvc.api.read()`, but it returns the contents of the file as
bytes instead of a string.

## Listing all DVC-tracked files recursively

```py
>>> fs.find("/", detail=False, dvc_only=True)
[
'/data/data.xml',
'/data/features/test.pkl',
'/data/features/train.pkl',
'/data/prepared/test.tsv',
'/data/prepared/train.tsv',
'/evaluation/importance.png',
'/model.pkl'
]
```

This is similar to `dvc ls --recursive --dvc-only` CLI command. Note that the
`"/"` is considered as the root of the Git repo. You can specify sub-paths to
only return entries in that directory. Similarly, there is `fs.ls()` that is
non-recursive.

## Listing all files (including Git-tracked)

```py
>>> fs.find("/", detail=False)
[
...
'/.gitignore',
'/README.md',
'/data/.gitignore',
'/data/data.xml',
'/data/features/test.pkl',
'/data/features/train.pkl',
'/data/prepared/test.tsv',
'/data/prepared/train.tsv',
...
'/evaluation/.gitignore',
'/evaluation/importance.png',
'/evaluation/plots/confusion_matrix.json',
'/evaluation/plots/precision_recall.json',
'/evaluation/plots/roc.json',
'/model.pkl',
...
]
```

This is similar to `dvc ls --recursive` CLI command. It returns all of the files
tracked by DVC and Git and if filesystem is opened locally, it also includes the
local untracked files.

## Downloading a file or a directory

```py
>>> fs.get_file("data/data.xml", "data.xml")
```

This downloads "data/data.xml" file to the current working directory as
"data.xml" file. The DVC-tracked files may be downloaded from the cache if it
exists or may get streamed from the remote.

```py
>>> fs.get("data", "data", recursive=True)
```

This downloads all the files in "data" directory - be it Git-tracked or
DVC-tracked into a local directory "data". Similarly, DVC might fetch files from
remote if they don't exist in the cache.

## API Reference

As DVCFileSystem is based on [fsspec](https://filesystem-spec.readthedocs.io/),
it is compatible with most of the APIs that it offers. Please check the fsspec's
[API Reference](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem)
for more details.
Comment on lines +121 to +126
Copy link
Contributor

@jorgeorpinel jorgeorpinel Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last note: we could use terms "properties" and "methods" instead in here, to clarify we mean the reference of this specific class/object, not some larger API.

p.s. we could put this section earlier too, probably before the example sections, even as a note (<admon> or not).

4 changes: 4 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -520,6 +520,10 @@
"label": "Python API Reference",
"source": "api-reference/index.md",
"children": [
{
"slug": "dvcfilesystem",
"label": "DVCFileSystem"
},
{
"slug": "get_url",
"label": "get_url()"
Expand Down