diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 9cf5e1b783..fe8bb25d0d 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -83,14 +83,9 @@ "source": "user-guide/index.md", "children": [ { - "slug": "what-is-dvc", "label": "What is DVC?", - "source": "what-is-dvc/index.md", - "children": [ - "collaboration-issues", - "core-features", - "related-technologies" - ] + "slug": "what-is-dvc", + "source": "what-is-dvc.md" }, { "label": "DVC Files and Directories", @@ -134,13 +129,10 @@ "slug": "running-dvc-on-windows" }, "troubleshooting", + "related-technologies", { "label": "Anonymized Usage Analytics", "slug": "analytics" - }, - { - "label": "Privacy Policy (Google APIs)", - "slug": "privacy" } ] }, diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md index 7640d75cb4..879a41f5d8 100644 --- a/content/docs/use-cases/index.md +++ b/content/docs/use-cases/index.md @@ -1,18 +1,26 @@ # Use Cases We provide short articles on common ML workflow or data management scenarios -that DVC can help with or improve. These include the motivating context (usually -extracted from real-life cases); And the approaches to solving them can combine -several features of DVC. Use cases are not written to be run end-to-end. For -more general, hands-on experience with DVC, we recommend following the -[Get Started](/doc/tutorials/get-started), and/or [Tutorials](/doc/tutorials) -first. +that DVC can help with or improve. These include a motivation (usually from +real-life cases), and approaches which combine several features of DVC. Use +cases are not written to be run end-to-end like tutorials. For more general, +hands-on experience with DVC, please see our +[Get Started](/doc/tutorials/get-started) instead. > We keep reviewing our docs and will include interesting scenarios that surface > in the community. Please, [contact us](/support) if you need help or have > suggestions! -## Basic uses +## Why DVC? + +Even with all the success we've seen today in machine learning (ML), especially +with deep learning and its applications in business, the data science community +still lacks good practices for organizing their projects and collaborating +effectively. This is a critical challenge: while ML algorithms and methods are +no longer tribal knowledge, they are still difficult to implement, reuse, and +manage. + +## Basic uses of DVC If you store and process data files or datasets to produce other data or machine learning models, and you want to diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 4bede657db..4ef8955a59 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -14,13 +14,13 @@ This allows easily saving and sharing data alongside code. ![](/img/model-versioning-diagram.png) -In this basic scenario, DVC is a better replacement for `git-lfs` (see -[Related Technologies](/doc/understanding-dvc/related-technologies)) and for -ad-hoc scripts on top of Amazon S3 (or any other cloud) used to manage ML -data artifacts like raw data, models, etc. Unlike `git-lfs`, DVC -doesn't require installing a dedicated server; It can be used on-premises (e.g. -SSH, NAS) or with any major cloud storage provider (Amazon S3, Microsoft Azure -Blob Storage, Google Drive, Google Cloud Storage, etc). +In this basic scenario, DVC is a better replacement for Git-LFS (see +[Related Technologies](/doc/user-guide/related-technologies)) and for ad-hoc +scripts on top of Amazon S3 (or any other cloud) used to manage ML data +artifacts like raw data, models, etc. Unlike Git-LFS, DVC doesn't require +installing a dedicated server; It can be used on-premises (e.g. SSH, NAS) or +with any major cloud storage provider (Amazon S3, Microsoft Azure Blob Storage, +Google Drive, Google Cloud Storage, etc). Let's say you already have a Git repository and put a bunch of images in the `images/` directory, and build a `model.pkl` ML model file using them. diff --git a/content/docs/user-guide/index.md b/content/docs/user-guide/index.md index e51926f40b..06e4116738 100644 --- a/content/docs/user-guide/index.md +++ b/content/docs/user-guide/index.md @@ -1,12 +1,12 @@ # User Guide -Our guides describe the main DVC concepts and features comprehensively, -explaining when and how to use them, as well as connections between them. These -guides don't focus on specific scenarios, but have a general scope – like a user -manual. Their topics range from more technical foundations, impacting more parts -of DVC, to more advanced and specific things you can do. We also include a few -guides related to contributing to -[this open-source project](https://github.com/iterative/dvc). +Our guides describe the major features and concepts of DVC comprehensively, +explaining when and how to use them, as well as relationship between these. We +don't focus on specific scenarios in this section, but rather on a general +scope. The topics here range from more foundational, impacting more parts of +DVC, to more technical and advanced things you can do. We also include a few +misc. guides, for example related to +[contributing to DVC](/doc/user-guide/contributing/core) itself. Please choose from the navigation sidebar to the left, or click the `Next` button below ↘ diff --git a/content/docs/user-guide/related-technologies.md b/content/docs/user-guide/related-technologies.md new file mode 100644 index 0000000000..7bbb8f70f4 --- /dev/null +++ b/content/docs/user-guide/related-technologies.md @@ -0,0 +1,133 @@ +# Comparison with Related Technologies + +DVC combines a number of existing ideas into a single tool, with the goal of +bringing best practices from software engineering into the data science field +(refer to [What is DVC?](/doc/user-guide/what-is-dvc) for more details). + +## Git + +- DVC builds upon Git by introducing the concept of data files – large files + that should not be stored in a Git repository, but still need to be tracked + and versioned. It leverages Git's features to enable managing different + versions of data itself, data pipelines, and experiments. + +- DVC is not fundamentally bound to Git, and can work without it (except + versioning-related features). This also applies to Git-LFS and Git-annex, + below. + +## Git-LFS (Large File Storage) + +- DVC does not require special servers like Git-LFS demands. Any cloud storage + like S3, Google Cloud Storage, or even an SSH server can be used as a + [remote storage](/doc/command-reference/remote). No additional databases, + servers, or infrastructure are required. + +- DVC does not add any hooks to the Git repo by default (although they are + [available](/doc/command-reference/install)). + +- Git-LFS was not made with data science in mind, so it doesn't provide related + features (e.g. [pipelines](/doc/command-reference/dag), + [metrics](/doc/command-reference/metrics), etc.). + +- Github (most common Git hosting service) has a limit of 2 GB per repository. + +## Git-annex + +- DVC can use reflinks\* or hardlinks (depending on the system) instead of + symlinks to improve performance and the user experience. + +- Git-annex is a datafile-centric system whereas DVC focuses on providing a + workflow for machine learning and reproducible experiments. When a DVC or + Git-annex repository is cloned via `git clone`, data files won't be copied to + the local machine, as file contents are stored in separate + [remotes](/doc/command-reference/remote). With DVC however, `.dvc` files, + which provide the reproducible workflow, are always included in the Git + repository. Hence, they can be executed locally with minimal effort. + +- DVC optimizes file hash calculation. + +> \* **copy-on-write links or "reflinks"** are a relatively new way to link +> files in UNIX-style file systems. Unlike hardlinks or symlinks, they support +> transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This +> means that editing a reflinked file is always safe as all the other links to +> the file will reflect the changes. + +## Git workflows/methodologies such as Gitflow + +- DVC enables a new experimentation methodology that integrates easily with + existing Git workflows. For example, a separate branch can be created for each + experiment, with a subsequent merge of the branch if the experiment is + successful. + +- DVC innovates by giving users the ability to easily navigate through past + experiments without recomputing them each time. + +## Workflow management systems + +Pipelines and dependency graphs +([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) such as _Airflow_, +_Luigi_, etc. + +- DVC is focused on data science and modeling. As a result, DVC pipelines are + lightweight and easy to create and modify. However, DVC lacks advanced + pipeline execution features like execution monitoring, error handling, and + recovering. + +- `dvc` is purely a command line tool without a graphical user interface (GUI) + and doesn't run any daemons or servers. Nevertheless, DVC can generate images + with pipeline and experiment workflow visualizations. + +- See also our sister project, [CML](https://cml.dev/), that helps fill some of + these gaps. + +## Experiment management software + +- DVC uses Git as the underlying layer for data, pipelines, an experiment + versioning, instead of a custom web application. + +- DVC doesn't need to run any services. There's no GUI as a result, but we + expect some GUI services will be created on top of DVC. + +- DVC can generate images with [experiment](/doc/start/experiments) workflow + visualizations. + +- DVC has transparent design. Its + [internal files and directories](/doc/user-guide/dvc-files-and-directories) + have a human-readable format and can be easily reused by external tools. + +## Build automation tools + +[_Make_](https://www.gnu.org/software/make/) and others. + +- File tracking: + + - DVC tracks files based on their hash values (MD5) instead of using + timestamps. This helps avoid running into heavy processes like model + retraining when you checkout a previous version of the project (Make would + retrain the model). + + - DVC uses file timestamps and inodes\* for optimization. This allows DVC to + avoid recomputing all dependency file hashes, which would be highly + problematic when working with large files (multiple GB). + +- DVC utilizes a + [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) + (DAG): + + - The DAG or dependency graph is defined implicitly by the connections between + pipeline [stages](/doc/command-reference/run), based on their + dependencies and outputs. + + - Each stage defines one node in the DAG. All DVC-files in a repository make + up a [pipelines](/doc/command-reference/dag) (think a single Makefile). All + stages (and corresponding processes) are implicitly combined through their + inputs and outputs, simplifying conflict resolution during merges. + + - DVC stages can be written manually in an intuitive `dvc.yaml` file, or + generated by the helper command `dvc run`, based on a terminal command, its + inputs, and outputs. + +> \* **Inodes** are metadata file records to locate and store permissions to the +> actual file contents. See **Linking files** in +> [this doc](http://www.tldp.org/LDP/intro-linux/html/sect_03_03.html) for +> technical details (Linux). diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md new file mode 100644 index 0000000000..bddf12cc5d --- /dev/null +++ b/content/docs/user-guide/what-is-dvc.md @@ -0,0 +1,48 @@ +# What Is DVC? + +**Data Version Control** is a new type of data versioning, workflow and +experiment management software, that builds upon [Git](https://git-scm.com/) +(although it can work stand-alone). DVC reduces the gap between established +engineering tool sets and data science needs, allowing users to take advantage +of new [features](#core-features) while reusing existing skills and intuition. + +![](/img/reproducibility.png) _DVC codifies data and ML experiments_ + +Data science experiment sharing and collaboration can be done through a regular +Git flow (commits, branching, pull requests, etc.), the same way it works for +software engineers. + +## Core Features + +- DVC is a [free](https://github.com/iterative/dvc/blob/master/LICENSE), + open-source [command line](/doc/command-reference) tool. + +- DVC works **on top of Git repositories** and has a similar command line + interface and flow as Git. DVC can also work stand-alone, but without + versioning capabilities. + +- **Data versioning** is enabled by replacing large files], dataset directories, + ML models, etc. with small + [metafiles](/doc/user-guide/dvc-files-and-directories) (easy to handle with + Git). These placeholders point to the original data, which is decoupled from + source code management. + +- **Data storage**: On-premises or cloud storage can be used to store the + project's data separate from its code base. This is how data scientists can + transfer large datasets or share a GPU-trained model with others. + +- DVC makes data science projects **reproducible** by creating lightweight + [pipelines](/doc/command-reference/dag) using implicit dependency graphs,and + codifying the data and artifacts involved. + +- DVC is **platform agnostic**: It runs on all major operating systems (Linux, + MacOS, and Windows), and works independently of the programming languages + (Python, R, Julia, shell scripts, etc.) or ML libraries (Keras, Tensorflow, + PyTorch, Scipy, etc.) used in the project. + +- **Easy to use**: DVC is quick to [install](/doc/install) and doesn't require + special infrastructure, nor does it depend on APIS or external services. It's + a stand-alone CLI tool. + + > Git servers, as well as SSH and cloud storage providers are supported, + > however. diff --git a/content/docs/user-guide/what-is-dvc/collaboration-issues.md b/content/docs/user-guide/what-is-dvc/collaboration-issues.md deleted file mode 100644 index bac66a65dc..0000000000 --- a/content/docs/user-guide/what-is-dvc/collaboration-issues.md +++ /dev/null @@ -1,53 +0,0 @@ -# Collaboration Issues in Data Science - -Even with all the success we've seen today in machine learning (ML), -specifically deep learning and its applications in business, the data science -community still lacks good practices for organizing their projects and -effectively collaborating across their varied ML projects. This is a critical -challenge: we need to evolve towards ML algorithms and methods no longer being -tribal knowledge and making them easy to implement, reuse, and manage. - -To make progress, many areas of the ML experimentation process need to be -formalized. Common questions need to be answered in an unified, principled way. - -## Questions - -### Source code and data versioning - -- How do you avoid discrepancies between - [revisions](https://git-scm.com/docs/revisions) of source code and versions of - data files, when the data cannot fit into a traditional repository? - -### Experiment time log - -- How do you track which of your - [hyperparameter]() - changes contributed the most to producing or improving your target - [metric](/doc/command-reference/metrics)? How do you monitor the degree of - each change? - -### Navigating through experiments - -- How do you recover a model from last week without wasting time waiting for the - model to retrain? - -- How do you quickly switch between a large dataset and a small subset without - modifying source code? - -### Reproducibility - -- How do you run a model's evaluation process again without retraining the model - and preprocessing a raw dataset? - -### Managing and sharing large data files - -- How do you share models trained in a GPU environment with colleagues who don't - have access to a GPU? - -- How do you share the entire 147 GB of your ML project, with all of its data - sources, intermediate data files, and models? - -Some of these questions are easy to answer individually. Data scientists, -engineers, or managers may already knows or can easily find answers to some of -them. However, the variety of answers and approaches makes data science -collaboration a nightmare. **A systematic approach is required.** diff --git a/content/docs/user-guide/what-is-dvc/core-features.md b/content/docs/user-guide/what-is-dvc/core-features.md deleted file mode 100644 index 5960a1feed..0000000000 --- a/content/docs/user-guide/what-is-dvc/core-features.md +++ /dev/null @@ -1,20 +0,0 @@ -# Core Features - -- DVC works **on top of Git repositories** and has a similar command line - interface and Git workflow. - -- It makes data science projects **reproducible** by creating lightweight - [pipelines](/doc/command-reference/dag) using implicit dependency graphs. - -- **Large data file versioning** works by creating special files in your Git - repository that point to the cache, typically stored on a local - hard drive. - -- DVC is **Programming language agnostic**: Python, R, Julia, shell scripts, - etc. as well as ML library agnostic: Keras, Tensorflow, PyTorch, Scipy, etc. - -- It's **Open-source** and **Self-serve**: DVC is free and doesn't require any - additional services. - -- DVC supports cloud storage (Amazon S3, Microsoft Azure Blob Storage, Google - Cloud Storage, etc.) for **data sources and pre-trained model sharing**. diff --git a/content/docs/user-guide/what-is-dvc/index.md b/content/docs/user-guide/what-is-dvc/index.md deleted file mode 100644 index 752dbfbe77..0000000000 --- a/content/docs/user-guide/what-is-dvc/index.md +++ /dev/null @@ -1,68 +0,0 @@ -# What Is DVC? - -Today the data science community is still lacking good practices for organizing -their projects and effectively collaborating. ML algorithms and methods are no -longer simple tribal knowledge but are still difficult to implement, manage and -reuse. - -> One of the biggest challenges in reusing, and hence the managing of ML -> projects, is its reproducibility. - -Data Version Control, or DVC, is a new type of experiment management software -built on top of Git. DVC reduces the gap between existing tools and data science -needs, allowing users to take advantage of experiment management while reusing -existing skills and intuition. - -![](/img/reproducibility.png)_DVC codifies data and ML experiments_ - -Leveraging an underlying source code management system eliminates the need to -use 3rd-party services. Data science experiment sharing and collaboration can be -done through regular Git features (commit messages, merges, pull requests, etc) -the same way it works for software engineers. - -DVC uses a few core concepts: - -- **Experiment**: Equivalent to a - [Git revision](https://git-scm.com/docs/revisions). Each experiment (extract - new features, change model hyperparameters, data cleaning, add a new data - source) can be performed in a separate branch or tag. DVC allows experiments - to be integrated into a Git repository history and never needs to recompute - the results after a successful merge. - -- **Experiment state** or state: Equivalent to a Git snapshot (all committed - files). A Git commit hash, branch or tag name, etc. can be used as a - [reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) to an - experiment state. - -- **Reproducibility**: Action to reproduce an experiment state. This action - generates output files (or directories) based on a set of input files and - source code. This action usually changes experiment state. - -- **Pipeline**: Dependency graph or series of commands to reproduce data - processing results. The commands are connected by their inputs - (dependencies) and outputs. Pipelines are defined by - special [stage files](/doc/command-reference/run) (similar to - [Makefiles](https://www.gnu.org/software/make/manual/make.html#Introduction)). - Refer to [pipeline](/doc/command-reference/dag) for more information. - -- **Workflow**: Set of experiments and relationships among them. Workflow - corresponds to the entire Git repository. - -- **Data files**: Cached files (for large files). Data files are stored outside - of the Git repository on a local/shared hard drive or remote storage, but - [DVC-files](/doc/user-guide/dvc-files-and-directories) describing that data - are stored in Git for DVC needs (to maintain pipelines and reproducibility). - -- **Cache directory**: Directory with all data files on a local hard drive or in - cloud storage, but not in the Git repository. See `dvc cache dir`. - -- **Cloud storage** support: available complement to the core DVC features. This - is how a data scientist transfers large data files or shares a GPU-trained - model with those without GPUs available. - -DVC streamlines large data files and binary models into a single Git environment -and this approach will not require storing binary files in your Git repository. -The diagram below describes all the DVC commands and relationships between a -local cache and remote storage: - -![](/img/flow-large.png)_DVC data management_ diff --git a/content/docs/user-guide/what-is-dvc/related-technologies.md b/content/docs/user-guide/what-is-dvc/related-technologies.md deleted file mode 100644 index 86049754fa..0000000000 --- a/content/docs/user-guide/what-is-dvc/related-technologies.md +++ /dev/null @@ -1,141 +0,0 @@ -# Comparison to Existing Technologies - -DVC takes a novel approach, and it may be easier to understand DVC in comparison -to existing technologies and tools. - -DVC combines a number of existing ideas into a single product, with the goal of -bringing best practices from software engineering into the data science field. - -## Differences with related tools - -### Git - -- DVC extends Git by introducing the concept of _data files_ – large files that - should NOT be stored in a Git repository but still need to be tracked and - versioned. - -### Workflow management tools - -Pipelines and dependency graphs -([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) such as Airflow, -Luigi, etc. - -- DVC is focused on data science and modeling. As a result, DVC pipelines are - lightweight and easy to create and modify. However, DVC lacks pipeline - execution features like execution monitoring, execution error handling, and - recovering. - -- DVC is purely a command line tool without a graphical user interface (GUI) and - doesn't run any daemons or servers. Nevertheless, DVC can generate images with - pipeline and experiment workflow visualizations. - -### Experiment management software - -- DVC uses Git as the underlying platform for experiment tracking instead of a - web application. - -- DVC doesn't need to run any services. There's no graphical user interface as a - result, but we expect some GUI services will be created on top of DVC. - -- DVC has transparent design. Its - [files and directories](/doc/user-guide/dvc-files-and-directories) (including - the cache directory) have a human-readable format and can be - easily reused by external tools. - -### Git workflows/methodologies such as Gitflow - -- DVC supports a new experimentation methodology that integrates easily with a - Git workflow. A separate branch can be created for each experiment, with a - subsequent merge of the branch if the experiment was successful. - -- DVC innovates by giving experimenters the ability to easily navigate through - past experiments without recomputing them each time. - -### Build automation tools - -[Make](https://www.gnu.org/software/make/) and others. - -- DVC utilizes a - [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) - (DAG): - - - The DAG or dependency graph is defined implicitly by the connections between - [DVC-files](/doc/user-guide/dvc-files-and-directories) (with file names - `.dvc`), based on their dependencies and outputs. - - - Each DVC-file defines one node in the DAG. All DVC-files in a repository - make up a single pipeline (think a single Makefile). All DVC-files (and - corresponding pipeline commands) are implicitly combined through their - inputs and outputs, simplifying conflict resolution during merges. - - - DVC provides a simple command – `dvc run` – to generate a DVC-file or "stage - file" automatically, based on the provided command, dependencies, and - outputs. - -- File tracking: - - - DVC tracks files based on their hashes (MD5) instead of file timestamps. - This helps avoid running into heavy processes like model retraining when you - checkout a previously trained version of a model (Make would retrain the - model). - - - DVC uses file timestamps and inodes for optimization. This allows DVC to - avoid recomputing all dependency file hashes, which would be highly - problematic when working with large files (10 GB+). - -### Git-annex - -- DVC uses the idea of storing the content of large files (that you don't want - to see in your Git repository) in a local key-value store and uses file - symlinks instead of the actual files. - -- DVC can use reflinks\* or hardlinks (depending on the system) instead of - symlinks to improve performance and the user experience. - -- DVC optimizes file hash calculation. - -- Git-annex is a datafile-centric system whereas DVC is focused on providing a - workflow for machine learning and reproducible experiments. When a DVC or - Git-annex repository is cloned via `git clone`, data files won't be copied to - the local machine, as file contents are stored in separate - [remotes](/doc/command-reference/remote). With DVC, - [DVC-files](/doc/user-guide/dvc-files-and-directories), which provide the - reproducible workflow, are always included in the Git repository. Hence, they - can be executed locally with minimal effort. - -- DVC is not fundamentally bound to Git, and users have the option of using DVC - without Git. - -### Git-LFS (Large File Storage) - -- DVC does not require special Git servers like Git-LFS demands. Any cloud - storage like S3, GCS, or an on-premises SSH server can be used as a backend - for datasets and models. No additional databases, servers, or infrastructure - are required. - -- DVC is not fundamentally bound to Git, and users have the option of using DVC - without Git. - -- DVC does not add any hooks to the Git repo by default. To checkout data files, - the `dvc checkout` command has to be run after each `git checkout` and - `git clone` command. It provides control for managing data and code - separately. Hooks could be configured to make workflows simpler. - -- DVC attempts to use reflinks\* and has other - [file linking options](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache). - This way the `dvc checkout` command does not actually copy data files from - cache to the workspace, as copying files is a heavy - operation for large files (30 GB+). - -- `git-lfs` was not made with data science scenarios in mind, so it does not - provide related features (e.g. pipelines, - [metrics](/doc/command-reference/metrics)), and thus Github has a limit of 2 - GB per repository. - ---- - -> \* **copy-on-write links or "reflinks"** are a relatively new way to link -> files in UNIX-style file systems. Unlike hardlinks or symlinks, they support -> transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This -> means that editing a reflinked file is always safe as all the other links to -> the file will reflect the changes. diff --git a/static/img/flow-large.png b/static/img/flow-large.png deleted file mode 100644 index 177108e6ec..0000000000 Binary files a/static/img/flow-large.png and /dev/null differ