diff --git a/public/static/docs/api-reference/get_url.md b/public/static/docs/api-reference/get_url.md new file mode 100644 index 0000000000..9a83e33a09 --- /dev/null +++ b/public/static/docs/api-reference/get_url.md @@ -0,0 +1,110 @@ +# dvc.api.get_url() + +Returns the URL to the storage location of a data file or directory tracked in a +DVC project. + +```py +def get_url(path: str, + repo: str = None, + rev: str = None, + remote: str = None) -> str +``` + +#### Usage: + +```py +import dvc.api + +resource_url = dvc.api.get_url( + 'get-started/data.xml', + repo='https://github.com/iterative/dataset-registry') + +# resource_url is now "https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355" +``` + +## Description + +Returns the URL string of the storage location (in a +[DVC remote](/doc/command-reference/remote)) where a target file or directory, +specified by its `path` in a `repo` (DVC project), is stored. + +The URL is formed by reading the project's +[remote configuration](/doc/command-reference/config#remote) and the +[DVC-file](/doc/user-guide/dvc-file-format) where the given `path` is an +output. The URL schema returned depends on the +[type](/doc/command-reference/remote/add#supported-storage-types) of the +`remote` used (see the [Parameters](#parameters) section). + +If the target is a directory, the returned URL will end in `.dir`. Refer to +[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +and `dvc add` to learn more about how DVC handles data directories. + +⚠️ This function does not check for the actual existence of the file or +directory in the remote storage. + +💡 Having the resource's URL, it should be possible to download it directly with +an appropriate library, such as +[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.download_fileobj) +or +[`paramiko`](https://docs.paramiko.org/en/stable/api/sftp.html#paramiko.sftp_client.SFTPClient.get). + +## Parameters + +- **`path`** - location and file name of the file or directory in `repo`, + relative to the project's root. + +- `repo` - specifies the location of the DVC project. It can be a URL or a file + system path. Both HTTP and SSH protocols are supported for online Git repos + (e.g. `[user@]server:project.git`). _Default_: The current project is used + (the current working directory tree is walked up to find it). + +- `rev` - Git commit (any [revision](https://git-scm.com/docs/revisions) such as + a branch or tag name, or a commit hash). If `repo` is not a Git repo, this + option is ignored. _Default_: `HEAD`. + +- `remote` - name of the [DVC remote](/doc/command-reference/remote) to use to + form the returned URL string. _Default_: The + [default remote](/doc/command-reference/remote/default) of `repo` is used. + +## Exceptions + +- `dvc.api.UrlNotDvcRepoError` - `repo` is not a DVC project. + +- `dvc.exceptions.NoRemoteError` - no `remote` is found. + +## Example: Getting the URL to a DVC-tracked file + +```py +import dvc.api + +resource_url = dvc.api.get_url( + 'get-started/data.xml', + repo='https://github.com/iterative/dataset-registry' + ) + +print(resource_url) +``` + +The script above prints + +`https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355` + +This URL represents the location where the data is stored, and is built by +reading the corresponding DVC-file +([`get-started/data.xml.dvc`](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc)) +where the `md5` file hash is stored, + +```yaml +outs: + - md5: a304afb96060aad90176268345e10355 + path: get-started/data.xml +``` + +and the project configuration +([`.dvc/config`](https://github.com/iterative/dataset-registry/blob/master/.dvc/config)) +where the remote URL is saved: + +```ini +['remote "storage"'] +url = https://remote.dvc.org/dataset-registry +``` diff --git a/public/static/docs/api-reference/index.md b/public/static/docs/api-reference/index.md new file mode 100644 index 0000000000..fbeaf79a26 --- /dev/null +++ b/public/static/docs/api-reference/index.md @@ -0,0 +1,16 @@ +# Python API + +DVC can be used as a Python library, simply [install](/doc/install) with `pip` +or `conda`. This reference provides the details about the functions in the API +module `dvc.api`, which can be imported any regular way, for example: + +```py +import dvc.api +``` + +The purpose of this API is to provide programatic access to the data or models +[stored and versioned](/doc/use-cases/versioning-data-and-model-files) in +DVC repositories from Python apps. + +Please choose a function from the navigation sidebar to the left, or click the +`Next` button below to jump into the first one ↘ diff --git a/public/static/docs/api-reference/open.md b/public/static/docs/api-reference/open.md new file mode 100644 index 0000000000..99602624c7 --- /dev/null +++ b/public/static/docs/api-reference/open.md @@ -0,0 +1,189 @@ +# dvc.api.open() + +Opens a tracked file. + +```py +def open(path: str, + repo: str = None, + rev: str = None, + remote: str = None, + mode: str = "r", + encoding: str = None) +``` + +#### Usage: + +```py +import dvc.api + +with dvc.api.open( + 'get-started/data.xml', + repo='https://github.com/iterative/dataset-registry' + ) as fd: + # ... fd is a file descriptor that can be processed normally. +``` + +## Description + +Open a data or model file tracked in a DVC project and generate a +corresponding +[file object](https://docs.python.org/3/glossary.html#term-file-object). The +file can be tracked by DVC or by Git. + +> The exact type of file object depends on the `mode` used. For more details, +> please refer to Python's +> [`open()`](https://docs.python.org/3/library/functions.html#open) built-in, +> which is used under the hood. + +`dvc.api.open()` may only be used as a +[context manager](https://www.python.org/dev/peps/pep-0343/#context-managers-in-the-standard-library) +(using the `with` keyword, as shown in the examples). + +> Use `dvc.api.read()` to get the complete file contents in a single function +> call – no _context manager_ involved. + +This function makes a direct connection to the +[remote storage](/doc/command-reference/remote/add#supported-storage-types) +(except for Google Drive), so the file contents can be streamed as they are +read. This means it does not require space on the disc to save the file before +making it accessible. The only exception is when using Google Drive as +[remote type](/doc/command-reference/remote/add#supported-storage-types). + +## Parameters + +- **`path`** - location and file name of the file in `repo`, relative to the + project's root. + +- `repo` - specifies the location of the DVC project. It can be a URL or a file + system path. Both HTTP and SSH protocols are supported for online Git repos + (e.g. `[user@]server:project.git`). _Default_: The current project is used + (the current working directory tree is walked up to find it). + +- `rev` - Git commit (any [revision](https://git-scm.com/docs/revisions) such as + a branch or tag name, or a commit hash). If `repo` is not a Git repo, this + option is ignored. _Default_: `HEAD`. + +- `remote` - name of the [DVC remote](/doc/command-reference/remote) to look for + the target data. _Default_: The + [default remote](/doc/command-reference/remote/default) of `repo` is used if a + `remote` argument is not given. For local projects, the cache is + tied before the default remote. + +- `mode` - specifies the mode in which the file is opened. Defaults to `"r"` + (read). Mirrors the namesake parameter in builtin + [`open()`](https://docs.python.org/3/library/functions.html#open). + +- `encoding` - + [codec](https://docs.python.org/3/library/codecs.html#standard-encodings) used + to decode the file contents to a string. This should only be used in text + mode. Defaults to `"utf-8"`. Mirrors the namesake parameter in builtin + `open()`. + +## Exceptions + +- `dvc.exceptions.FileMissingError` - file in `path` is missing from `repo`. + +- `dvc.exceptions.PathMissingError` - `path` cannot be found in `repo`. + +- `dvc.api.UrlNotDvcRepoError` - `repo` is not a DVC project. + +- `dvc.exceptions.NoRemoteError` - no `remote` is found. + +## Example: Use data or models from DVC repositories + +Any data artifact can be employed directly in your Python app by +using this API. For example, an XML file tracked in a public DVC repo on Github +can be processed directly in your Python app with: + +```py +from xml.dom.minidom import parse +import dvc.api + +with dvc.api.open( + 'get-started/data.xml', + repo='https://github.com/iterative/dataset-registry' + ) as fd: + xmldom = parse(fd) + # ... Process DOM +``` + +> Notice that if you just need to load the complete file contents to memory, you +> can use `dvc.api.read()` instead: +> +> ```py +> xmldata = dvc.api.read('get-started/data.xml', +> repo='https://github.com/iterative/dataset-registry') +> xmldom = parse(xmldata) +> ``` + +Now let's imagine you want to deserialize and use a binary model from a private +repo. For a case like this, we can use an SSH URL instead (assuming the +[credentials are configured](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh) +locally): + +```py +import pickle +import dvc.api + +with dvc.api.open( + 'model.pkl', + repo='git@server.com:path/to/repo.git' + ) as fd: + model = pickle.load(fd) + # ... Use instanciated model +``` + +## Example: Use different versions of data + +The `rev` argument lets you specify any Git commit to look for an artifact. This +way any previous version, or alternative experiment can be accessed +programmatically. For example, let's say your DVC repo has tagged releases of a +CSV dataset: + +```py +import csv +import dvc.api + +with dvc.api.open( + 'clean.csv', + rev='v1.1.0' + ) as fd: + reader = csv.reader(fd) + # ... Read clean data from version 1.1.0 +``` + +Also, notice that we didn't supply a `repo` argument in this example. DVC will +attempt to find a DVC project to use in the current working +directory tree, and look for the file contents of `clean.csv` in its local +cache; no download will happen if found. See the +[Parameters](#parameters) section for more info. + +Note: to specify the file encoding of a text file, use: + +```py +import dvc.api + +with dvc.api.open( + 'data/nlp/words_ru.txt', + encoding='koi8_r') as fd: + # ... +``` + +## Example: Chose a specific remote as the data source + +Sometimes we may want to choose the [remote](/doc/command-reference/remote) data +source, for example if the `repo` has no default remote set. This can be done by +providing a `remote` argument: + +```py +import dvc.api + +with open( + 'activity.log', + repo='location/of/dvc/project', + remote='my-s3-bucket' + ) as fd: + for line in fd: + match = re.search(r'user=(\w+)', line) + # ... +``` diff --git a/public/static/docs/api-reference/read.md b/public/static/docs/api-reference/read.md new file mode 100644 index 0000000000..4fc6def66f --- /dev/null +++ b/public/static/docs/api-reference/read.md @@ -0,0 +1,102 @@ +# dvc.api.read() + +Returns the contents of a tracked file. + +```py +def open(path: str, + repo: str = None, + rev: str = None, + remote: str = None, + mode: str = "r", + encoding: str = None) +``` + +#### Usage: + +```py +import dvc.api + +modelpkl = dvc.api.read( + 'model.pkl', + repo='https://github.com/example/project.git' + mode='rb') +``` + +## Description + +This function wraps [`dvc.api.open()`](/doc/api-reference/open), for a simple +way to return the complete contents of a file tracked in a DVC +project. The file can be tracked by DVC or by Git. + +The returned contents can be a +[string](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) +or a [bytearray](https://docs.python.org/3/library/stdtypes.html#bytearray). + +> The type returned depends on the `mode` used. For more details, please refer +> to Python's [`open()`](https://docs.python.org/3/library/functions.html#open) +> built-in, which is used under the hood. + +> This is similar to the `dvc get` command in our CLI. + +## Parameters + +- **`path`** - location and file name of the file in `repo`, relative to the + project's root. + +- `repo` - specifies the location of the DVC project. It can be a URL or a file + system path. Both HTTP and SSH protocols are supported for online Git repos + (e.g. `[user@]server:project.git`). _Default_: The current project is used + (the current working directory tree is walked up to find it). + +- `rev` - Git commit (any [revision](https://git-scm.com/docs/revisions) such as + a branch or tag name, or a commit hash). If `repo` is not a Git repo, this + option is ignored. _Default_: `HEAD`. + +- `remote` - name of the [DVC remote](/doc/command-reference/remote) to look for + the target data. _Default_: The + [default remote](/doc/command-reference/remote/default) of `repo` is used if a + `remote` argument is not given. For local projects, the cache is + tied before the default remote. + +- `mode` - specifies the mode in which the file is opened. Defaults to `"r"` + (read). Mirrors the namesake parameter in builtin + [`open()`](https://docs.python.org/3/library/functions.html#open). + +- `encoding` - + [codec](https://docs.python.org/3/library/codecs.html#standard-encodings) used + to decode the file contents to a string. This should only be used in text + mode. Defaults to `"utf-8"`. Mirrors the namesake parameter in builtin + `open()`. + +## Exceptions + +- `dvc.exceptions.FileMissingError` - file in `path` is missing from `repo`. + +- `dvc.exceptions.PathMissingError` - `path` cannot be found in `repo`. + +- `dvc.api.UrlNotDvcRepoError` - `repo` is not a DVC project. + +- `dvc.exceptions.NoRemoteError` - no `remote` is found. + +## Example: Load data from a DVC repository + +Any data artifact can be employed directly in your Python app by +using this API. + +For example, let's say that you want to unserialize and use a binary model from +an online repo: + +```py +import pickle +import dvc.api + +model = pickle.loads( + dvc.api.read( + 'model.pkl', + repo='https://github.com/example/project.git' + mode='rb' + ) + ) +``` + +> We're using `'rb'` mode here for compatibility with `pickle.loads()`. diff --git a/public/static/docs/command-reference/get-url.md b/public/static/docs/command-reference/get-url.md index 1a45e8c992..4d7fccf6a6 100644 --- a/public/static/docs/command-reference/get-url.md +++ b/public/static/docs/command-reference/get-url.md @@ -3,8 +3,8 @@ Download a file or directory from a supported URL (for example `s3://`, `ssh://`, and other protocols) into the local file system. -> Unlike `dvc import-url`, this command does not track the downloaded data files -> (does not create a DVC-file). +> See `dvc get` to download data/model files or directories from other DVC +> repositories (e.g. hosted on GitHub). ## Synopsis @@ -22,15 +22,15 @@ In some cases it's convenient to get a data artifact from a remote location into the local file system. The `dvc get-url` command helps the user do just that. +> Note that unlike `dvc import-url`, this command does not track the downloaded +> data files (does not create a DVC-file). For that reason, this command doesn't +> require an existing DVC project to run in. + The `url` argument should provide the location of the data to be downloaded, while `out` can be used to specify the directory and/or file name desired for the downloaded data. If an existing directory is specified, then the output will be placed inside of it. -Note that this command doesn't require an existing DVC project to -run in. It's a single-purpose command that can be used out of the box after -installing DVC. - DVC supports several types of (local or) remote locations (protocols): | Type | Description | `url` format | @@ -61,9 +61,6 @@ HTTP(S) it's possible to instead use: $ wget https://example.com/path/to/data.csv ``` -> See `dvc get` to download data/model files or directories from other DVC -> repositories (e.g. GitHub URLs). - ## Options - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index 79c8d7e871..5cff7a7bb2 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -3,8 +3,7 @@ Download a file or directory tracked by DVC or by Git into the current working directory. -> Unlike `dvc import`, this command does not track the downloaded files (does -> not create a DVC-file). +> See also our `dvc.api.open()` Python API function. ## Synopsis @@ -21,11 +20,12 @@ positional arguments: Provides an easy way to download files or directories tracked in any DVC repository (e.g. datasets, intermediate results, ML models), or Git repository (e.g. source code, small image/other files). `dvc get` copies the -target file or directory (`url`/`path`) to the current working directory. -(Analogous to `wget`, but for repos.) +target file or directory (found at `path` in `url`) to the current working +directory. (Analogous to `wget`, but for repos.) -Note that this command doesn't require an existing DVC project to run in. It's a -single-purpose command that can be used out of the box after installing DVC. +> Note that unlike `dvc import`, this command does not track the downloaded +> files (does not create a DVC-file). For that reason, this command doesn't +> require an existing DVC project to run in. The `url` argument specifies the address of the DVC or Git repository containing the data source. Both HTTP and SSH protocols are supported for online repos diff --git a/public/static/docs/command-reference/import-url.md b/public/static/docs/command-reference/import-url.md index 245928d7b5..bd5b668ae6 100644 --- a/public/static/docs/command-reference/import-url.md +++ b/public/static/docs/command-reference/import-url.md @@ -4,8 +4,8 @@ Download a file or directory from a supported URL (for example `s3://`, `ssh://`, and other protocols) into the workspace, and track changes in the remote data source. Creates a DVC-file. -> See also `dvc get-url`, that corresponds to the first half of what this -> command does (downloading the data artifact). +> See `dvc import` to download and tack data/model files or directories from +> other DVC repositories (e.g. hosted on GitHub). ## Synopsis @@ -28,6 +28,9 @@ external data source changes. Example scenarios: - A batch process running regularly updates a data file to import. - A shared dataset on a remote storage that is managed and updated outside DVC. +> Note that `dvc get-url` corresponds to the first step this command performs +> (just download the file or directory). + The `dvc import-url` command helps the user create such an external data dependency. The `url` argument specifies the external location of the data to be imported, while `out` can be used to specify the directory and/or file name @@ -103,9 +106,6 @@ Note that import stages are considered always locked, meaning that if you run `dvc repro`, they won't be updated. Use `dvc update` on them to bring the import up to date from the external data source. -> See `dvc import` to download and tack data/model files or directories from -> other DVC repositories (e.g. GitHub URLs). - ## Options - `-f`, `--file` - specify a path and/or file name for the DVC-file created by diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index 62c398b3e1..c8e2443880 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -6,8 +6,7 @@ Download a file or directory tracked by DVC or by Git into the source, which can later be used to [update](/doc/command-reference/update) the import. -> See also `dvc get`, that corresponds to the first step this command performs -> (just download the data). +> See also our `dvc.api.open()` Python API function. ## Synopsis @@ -24,9 +23,13 @@ positional arguments: Provides an easy way to reuse files or directories tracked in any DVC repository (e.g. datasets, intermediate results, ML models) or Git repository (e.g. source code, small image/other files). `dvc import` downloads -the target file or directory (`url`/`path`) in a way so that it's tracked with -DVC, becoming a local data artifact. This also permits updating the -import later, if it has changed in its data source. (See `dvc update`.) +the target file or directory (found at `path` in `url`) in a way so that it's +tracked with DVC, becoming a local data artifact. This also permits +updating the import later, if it has changed in its data source. (See +`dvc update`.) + +> Note that `dvc get` corresponds to the first step this command performs (just +> download the data). The `url` argument specifies the address of the DVC or Git repository containing the data source. Both HTTP and SSH protocols are supported for online repos diff --git a/public/static/docs/glossary.js b/public/static/docs/glossary.js index bb6d10bc59..fe982958ac 100644 --- a/public/static/docs/glossary.js +++ b/public/static/docs/glossary.js @@ -16,6 +16,7 @@ code, ML models, etc. It will conatain your DVC project. name: 'DVC Project', match: [ 'DVC project', + 'DVC projects', 'project', 'projects', 'DVC repository', diff --git a/public/static/docs/install/index.md b/public/static/docs/install/index.md index c3ed40abb6..c375c0ca61 100644 --- a/public/static/docs/install/index.md +++ b/public/static/docs/install/index.md @@ -7,6 +7,13 @@ Please double check that you don't already have DVC (for example running - [Install on Windows](/doc/install/windows) - [Install on Linux](/doc/install/linux) +## Install as a Python library + +DVC can be used as a Python library, simply install it with a package manager +like `pip` or `conda`, and as a Python +[project requirement](https://pip.pypa.io/en/latest/user_guide/#requirements-files) +if needed. The [Python API](/doc/api-reference) module is `dvc.api`. + ## Advanced options - Shell completion is automatically enabled by certain installation methods. If diff --git a/public/static/docs/install/linux.md b/public/static/docs/install/linux.md index 598f73b231..5b5ab6c22d 100644 --- a/public/static/docs/install/linux.md +++ b/public/static/docs/install/linux.md @@ -1,5 +1,8 @@ # Installation on Linux +> To use DVC [as a Python library](/doc/api-reference), please +> [install with pip](#install-with-pip) or [with conda](#install-with-conda). + ## Install with pip > We **strongly** recommend creating a diff --git a/public/static/docs/install/macos.md b/public/static/docs/install/macos.md index d7d9550c9e..3a231e4647 100644 --- a/public/static/docs/install/macos.md +++ b/public/static/docs/install/macos.md @@ -1,5 +1,8 @@ # Installation on MacOS +> To use DVC [as a Python library](/doc/api-reference), please +> [install with pip](#install-with-pip) or [with conda](#install-with-conda). + ## Install with brew Recommended. Requires [Homebrew](https://brew.sh/). diff --git a/public/static/docs/install/windows.md b/public/static/docs/install/windows.md index b1b1a1b762..cc2e78e4fe 100644 --- a/public/static/docs/install/windows.md +++ b/public/static/docs/install/windows.md @@ -4,6 +4,11 @@ > [Running DVC on Windows](/doc/user-guide/running-dvc-on-windows) for important > tips to improve your experience using DVC on Windows. + + +> To use DVC [as a Python library](/doc/api-reference), please +> [install with pip](#install-with-pip) or [with conda](#install-with-conda). + ## Windows installer The easiest way is to use the self-contained, executable installer (binary), diff --git a/public/static/docs/sidebar.json b/public/static/docs/sidebar.json index ab11ce0e2b..e808562b35 100644 --- a/public/static/docs/sidebar.json +++ b/public/static/docs/sidebar.json @@ -364,6 +364,25 @@ } ] }, + { + "slug": "api-reference", + "label": "Python API Reference", + "source": "api-reference/index.md", + "children": [ + { + "slug": "get_url", + "label": "get_url()" + }, + { + "slug": "open", + "label": "open()" + }, + { + "slug": "read", + "label": "read()" + } + ] + }, { "slug": "understanding-dvc", "label": "Understanding DVC", diff --git a/public/static/docs/use-cases/data-registries.md b/public/static/docs/use-cases/data-registries.md index 72a425eff4..4cb28cde84 100644 --- a/public/static/docs/use-cases/data-registries.md +++ b/public/static/docs/use-cases/data-registries.md @@ -89,8 +89,8 @@ $ dvc push ## Using registries The main methods to consume data artifacts from a **data registry** -are the `dvc import` and `dvc get` commands, as well as the `dvc.api` Python -API. +are the `dvc import` and `dvc get` commands, as well as the +[`dvc.api`](/doc/api-reference) Python API. ### Simple download (get) diff --git a/scripts/exclude-links.txt b/scripts/exclude-links.txt index fdfed64b7a..d9f10dcbea 100644 --- a/scripts/exclude-links.txt +++ b/scripts/exclude-links.txt @@ -33,6 +33,9 @@ https://man.dvc.org/foo https://marketplace.visualstudio.com/items?itemName=stkb.rewrap https://myendpoint.com https://object-storage.example.com +https://remote.dvc.org/dataset-registry +https://remote.dvc.org/dataset-registry/a3/04af... +https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355 https://remote.dvc.org/foo/bar https://remote.dvc.org/get-started https://s3-us-east-2.amazonaws.com/dvc-public/code/foo/bar