Skip to content

Commit

Permalink
Merge pull request #908 from iterative/api
Browse files Browse the repository at this point in the history
api: create docs
  • Loading branch information
jorgeorpinel authored Mar 8, 2020
2 parents c57063f + 7733fea commit 8bba8e8
Show file tree
Hide file tree
Showing 16 changed files with 485 additions and 27 deletions.
110 changes: 110 additions & 0 deletions public/static/docs/api-reference/get_url.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# dvc.api.get_url()

Returns the URL to the storage location of a data file or directory tracked in a
<abbr>DVC project</abbr>.

```py
def get_url(path: str,
repo: str = None,
rev: str = None,
remote: str = None) -> str
```

#### Usage:

```py
import dvc.api

resource_url = dvc.api.get_url(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry')

# resource_url is now "https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355"
```

## Description

Returns the URL string of the storage location (in a
[DVC remote](/doc/command-reference/remote)) where a target file or directory,
specified by its `path` in a `repo` (<abbr>DVC project</abbr>), is stored.

The URL is formed by reading the project's
[remote configuration](/doc/command-reference/config#remote) and the
[DVC-file](/doc/user-guide/dvc-file-format) where the given `path` is an
<abbr>output</abbr>. The URL schema returned depends on the
[type](/doc/command-reference/remote/add#supported-storage-types) of the
`remote` used (see the [Parameters](#parameters) section).

If the target is a directory, the returned URL will end in `.dir`. Refer to
[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
and `dvc add` to learn more about how DVC handles data directories.

⚠️ This function does not check for the actual existence of the file or
directory in the remote storage.

💡 Having the resource's URL, it should be possible to download it directly with
an appropriate library, such as
[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.download_fileobj)
or
[`paramiko`](https://docs.paramiko.org/en/stable/api/sftp.html#paramiko.sftp_client.SFTPClient.get).

## Parameters

- **`path`** - location and file name of the file or directory in `repo`,
relative to the project's root.

- `repo` - specifies the location of the DVC project. It can be a URL or a file
system path. Both HTTP and SSH protocols are supported for online Git repos
(e.g. `[user@]server:project.git`). _Default_: The current project is used
(the current working directory tree is walked up to find it).

- `rev` - Git commit (any [revision](https://git-scm.com/docs/revisions) such as
a branch or tag name, or a commit hash). If `repo` is not a Git repo, this
option is ignored. _Default_: `HEAD`.

- `remote` - name of the [DVC remote](/doc/command-reference/remote) to use to
form the returned URL string. _Default_: The
[default remote](/doc/command-reference/remote/default) of `repo` is used.

## Exceptions

- `dvc.api.UrlNotDvcRepoError` - `repo` is not a DVC project.

- `dvc.exceptions.NoRemoteError` - no `remote` is found.

## Example: Getting the URL to a DVC-tracked file

```py
import dvc.api

resource_url = dvc.api.get_url(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
)

print(resource_url)
```

The script above prints

`https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355`

This URL represents the location where the data is stored, and is built by
reading the corresponding DVC-file
([`get-started/data.xml.dvc`](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc))
where the `md5` file hash is stored,

```yaml
outs:
- md5: a304afb96060aad90176268345e10355
path: get-started/data.xml
```

and the project configuration
([`.dvc/config`](https://github.com/iterative/dataset-registry/blob/master/.dvc/config))
where the remote URL is saved:

```ini
['remote "storage"']
url = https://remote.dvc.org/dataset-registry
```
16 changes: 16 additions & 0 deletions public/static/docs/api-reference/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Python API

DVC can be used as a Python library, simply [install](/doc/install) with `pip`
or `conda`. This reference provides the details about the functions in the API
module `dvc.api`, which can be imported any regular way, for example:

```py
import dvc.api
```

The purpose of this API is to provide programatic access to the data or models
[stored and versioned](/doc/use-cases/versioning-data-and-model-files) in
<abbr>DVC repositories</abbr> from Python apps.

Please choose a function from the navigation sidebar to the left, or click the
`Next` button below to jump into the first one ↘
189 changes: 189 additions & 0 deletions public/static/docs/api-reference/open.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# dvc.api.open()

Opens a tracked file.

```py
def open(path: str,
repo: str = None,
rev: str = None,
remote: str = None,
mode: str = "r",
encoding: str = None)
```

#### Usage:

```py
import dvc.api

with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
# ... fd is a file descriptor that can be processed normally.
```

## Description

Open a data or model file tracked in a <abbr>DVC project</abbr> and generate a
corresponding
[file object](https://docs.python.org/3/glossary.html#term-file-object). The
file can be tracked by DVC or by Git.

> The exact type of file object depends on the `mode` used. For more details,
> please refer to Python's
> [`open()`](https://docs.python.org/3/library/functions.html#open) built-in,
> which is used under the hood.
`dvc.api.open()` may only be used as a
[context manager](https://www.python.org/dev/peps/pep-0343/#context-managers-in-the-standard-library)
(using the `with` keyword, as shown in the examples).

> Use `dvc.api.read()` to get the complete file contents in a single function
> call – no _context manager_ involved.
This function makes a direct connection to the
[remote storage](/doc/command-reference/remote/add#supported-storage-types)
(except for Google Drive), so the file contents can be streamed as they are
read. This means it does not require space on the disc to save the file before
making it accessible. The only exception is when using Google Drive as
[remote type](/doc/command-reference/remote/add#supported-storage-types).

## Parameters

- **`path`** - location and file name of the file in `repo`, relative to the
project's root.

- `repo` - specifies the location of the DVC project. It can be a URL or a file
system path. Both HTTP and SSH protocols are supported for online Git repos
(e.g. `[user@]server:project.git`). _Default_: The current project is used
(the current working directory tree is walked up to find it).

- `rev` - Git commit (any [revision](https://git-scm.com/docs/revisions) such as
a branch or tag name, or a commit hash). If `repo` is not a Git repo, this
option is ignored. _Default_: `HEAD`.

- `remote` - name of the [DVC remote](/doc/command-reference/remote) to look for
the target data. _Default_: The
[default remote](/doc/command-reference/remote/default) of `repo` is used if a
`remote` argument is not given. For local projects, the <abbr>cache</abbr> is
tied before the default remote.

- `mode` - specifies the mode in which the file is opened. Defaults to `"r"`
(read). Mirrors the namesake parameter in builtin
[`open()`](https://docs.python.org/3/library/functions.html#open).

- `encoding` -
[codec](https://docs.python.org/3/library/codecs.html#standard-encodings) used
to decode the file contents to a string. This should only be used in text
mode. Defaults to `"utf-8"`. Mirrors the namesake parameter in builtin
`open()`.

## Exceptions

- `dvc.exceptions.FileMissingError` - file in `path` is missing from `repo`.

- `dvc.exceptions.PathMissingError` - `path` cannot be found in `repo`.

- `dvc.api.UrlNotDvcRepoError` - `repo` is not a DVC project.

- `dvc.exceptions.NoRemoteError` - no `remote` is found.

## Example: Use data or models from DVC repositories

Any <abbr>data artifact</abbr> can be employed directly in your Python app by
using this API. For example, an XML file tracked in a public DVC repo on Github
can be processed directly in your Python app with:

```py
from xml.dom.minidom import parse
import dvc.api

with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
xmldom = parse(fd)
# ... Process DOM
```

> Notice that if you just need to load the complete file contents to memory, you
> can use `dvc.api.read()` instead:
>
> ```py
> xmldata = dvc.api.read('get-started/data.xml',
> repo='https://github.com/iterative/dataset-registry')
> xmldom = parse(xmldata)
> ```
Now let's imagine you want to deserialize and use a binary model from a private
repo. For a case like this, we can use an SSH URL instead (assuming the
[credentials are configured](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh)
locally):
```py
import pickle
import dvc.api
with dvc.api.open(
'model.pkl',
repo='[email protected]:path/to/repo.git'
) as fd:
model = pickle.load(fd)
# ... Use instanciated model
```
## Example: Use different versions of data

The `rev` argument lets you specify any Git commit to look for an artifact. This
way any previous version, or alternative experiment can be accessed
programmatically. For example, let's say your DVC repo has tagged releases of a
CSV dataset:

```py
import csv
import dvc.api

with dvc.api.open(
'clean.csv',
rev='v1.1.0'
) as fd:
reader = csv.reader(fd)
# ... Read clean data from version 1.1.0
```

Also, notice that we didn't supply a `repo` argument in this example. DVC will
attempt to find a <abbr>DVC project</abbr> to use in the current working
directory tree, and look for the file contents of `clean.csv` in its local
<abbr>cache</abbr>; no download will happen if found. See the
[Parameters](#parameters) section for more info.

Note: to specify the file encoding of a text file, use:

```py
import dvc.api

with dvc.api.open(
'data/nlp/words_ru.txt',
encoding='koi8_r') as fd:
# ...
```

## Example: Chose a specific remote as the data source

Sometimes we may want to choose the [remote](/doc/command-reference/remote) data
source, for example if the `repo` has no default remote set. This can be done by
providing a `remote` argument:

```py
import dvc.api

with open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ...
```
Loading

0 comments on commit 8bba8e8

Please sign in to comment.