From 253f87d8f1314b6f94f6627ac0848470a967aaba Mon Sep 17 00:00:00 2001 From: V Abhijith Rao Date: Tue, 19 May 2020 20:36:31 +0530 Subject: [PATCH 1/2] Absorbing Understanding DVC I have deleted most of the content regarding explaination of core concepts and features and merged the more imortant part of this section into the user guide and the getting started section. Please do let me know if there are further changes that I would have to bring about. --- .../understanding-dvc/collaboration-issues.md | 53 ------- .../docs/understanding-dvc/core-features.md | 20 --- .../docs/understanding-dvc/existing-tools.md | 34 ----- .../understanding-dvc/related-technologies.md | 141 ------------------ content/docs/understanding-dvc/what-is-dvc.md | 57 ------- content/docs/use-cases/index.md | 9 ++ .../how-it-works.md | 0 content/docs/user-guide/index.md | 16 ++ .../resources.md | 0 9 files changed, 25 insertions(+), 305 deletions(-) delete mode 100644 content/docs/understanding-dvc/collaboration-issues.md delete mode 100644 content/docs/understanding-dvc/core-features.md delete mode 100644 content/docs/understanding-dvc/existing-tools.md delete mode 100644 content/docs/understanding-dvc/related-technologies.md delete mode 100644 content/docs/understanding-dvc/what-is-dvc.md rename content/docs/{understanding-dvc => user-guide}/how-it-works.md (100%) rename content/docs/{understanding-dvc => user-guide}/resources.md (100%) diff --git a/content/docs/understanding-dvc/collaboration-issues.md b/content/docs/understanding-dvc/collaboration-issues.md deleted file mode 100644 index bac66a65dc..0000000000 --- a/content/docs/understanding-dvc/collaboration-issues.md +++ /dev/null @@ -1,53 +0,0 @@ -# Collaboration Issues in Data Science - -Even with all the success we've seen today in machine learning (ML), -specifically deep learning and its applications in business, the data science -community still lacks good practices for organizing their projects and -effectively collaborating across their varied ML projects. This is a critical -challenge: we need to evolve towards ML algorithms and methods no longer being -tribal knowledge and making them easy to implement, reuse, and manage. - -To make progress, many areas of the ML experimentation process need to be -formalized. Common questions need to be answered in an unified, principled way. - -## Questions - -### Source code and data versioning - -- How do you avoid discrepancies between - [revisions](https://git-scm.com/docs/revisions) of source code and versions of - data files, when the data cannot fit into a traditional repository? - -### Experiment time log - -- How do you track which of your - [hyperparameter]() - changes contributed the most to producing or improving your target - [metric](/doc/command-reference/metrics)? How do you monitor the degree of - each change? - -### Navigating through experiments - -- How do you recover a model from last week without wasting time waiting for the - model to retrain? - -- How do you quickly switch between a large dataset and a small subset without - modifying source code? - -### Reproducibility - -- How do you run a model's evaluation process again without retraining the model - and preprocessing a raw dataset? - -### Managing and sharing large data files - -- How do you share models trained in a GPU environment with colleagues who don't - have access to a GPU? - -- How do you share the entire 147 GB of your ML project, with all of its data - sources, intermediate data files, and models? - -Some of these questions are easy to answer individually. Data scientists, -engineers, or managers may already knows or can easily find answers to some of -them. However, the variety of answers and approaches makes data science -collaboration a nightmare. **A systematic approach is required.** diff --git a/content/docs/understanding-dvc/core-features.md b/content/docs/understanding-dvc/core-features.md deleted file mode 100644 index 44dee4ad2e..0000000000 --- a/content/docs/understanding-dvc/core-features.md +++ /dev/null @@ -1,20 +0,0 @@ -# Core Features - -- DVC works **on top of Git repositories** and has a similar command line - interface and Git workflow. - -- It makes data science projects **reproducible** by creating lightweight - [pipelines](/doc/command-reference/pipeline) using implicit dependency graphs. - -- **Large data file versioning** works by creating special files in your Git - repository that point to the cache, typically stored on a local - hard drive. - -- DVC is **Programming language agnostic**: Python, R, Julia, shell scripts, - etc. as well as ML library agnostic: Keras, Tensorflow, PyTorch, Scipy, etc. - -- It's **Open-source** and **Self-serve**: DVC is free and doesn't require any - additional services. - -- DVC supports cloud storage (Amazon S3, Microsoft Azure Blob Storage, Google - Cloud Storage, etc.) for **data sources and pre-trained model sharing**. diff --git a/content/docs/understanding-dvc/existing-tools.md b/content/docs/understanding-dvc/existing-tools.md deleted file mode 100644 index 279434f94a..0000000000 --- a/content/docs/understanding-dvc/existing-tools.md +++ /dev/null @@ -1,34 +0,0 @@ -# Tools for Data Scientists - -## Existing engineering tools - -There is one thing that data scientists seem to agree on around tooling: as -engineers, we'd like to use the same best practices and collaboration software -that's standard in software engineering. A source code version control system -(Git), continuous integration services (CI), and unit test frameworks are all -expected to be utilized in data science -[pipelines](/doc/command-reference/pipeline). - -But a comprehensive look at data science processes shows that the software -engineering toolset does not completely cover data science needs. Try to answer -all the questions from the above using only engineering tools, and you're likely -to be left wanting more. - -## Experiment management software - -This new type of software was created to solve data science collaboration -issues. Experiment management software aims to cover the gap between data -scientist needs and the existing toolset from software engineering. - -Experiment management software is usually **graphical user interface** (GUI) -based, in contrast to existing command line engineering tools. The GUI is a -bridge to a separate **cloud based environment**. The cloud environment is -usually not as flexible as local data scientist environments, and isn't fully -integrated with local environments either. - -The separation of the local data scientist environment and the experimentation -cloud environment creates another discrepancy issue, and environment -synchronization requires addition work. Also, this style of software usually -requires external services that aren't free. This might be a good solution for a -particular companies or groups of data scientists. but a more accessible, free -tool is needed for a wider audience. diff --git a/content/docs/understanding-dvc/related-technologies.md b/content/docs/understanding-dvc/related-technologies.md deleted file mode 100644 index d2b3729e8b..0000000000 --- a/content/docs/understanding-dvc/related-technologies.md +++ /dev/null @@ -1,141 +0,0 @@ -# Comparison to Existing Technologies - -DVC takes a novel approach, and it may be easier to understand DVC in comparison -to existing technologies and tools. - -DVC combines a number of existing ideas into a single product, with the goal of -bringing best practices from software engineering into the data science field. - -## Differences with related tools - -### Git - -- DVC extends Git by introducing the concept of _data files_ – large files that - should NOT be stored in a Git repository but still need to be tracked and - versioned. - -### Workflow management tools - -Pipelines and dependency graphs -([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) such as Airflow, -Luigi, etc. - -- DVC is focused on data science and modeling. As a result, DVC pipelines are - lightweight and easy to create and modify. However, DVC lacks pipeline - execution features like execution monitoring, execution error handling, and - recovering. - -- DVC is purely a command line tool without a graphical user interface (GUI) and - doesn't run any daemons or servers. Nevertheless, DVC can generate images with - pipeline and experiment workflow visualizations. - -### Experiment management software - -- DVC uses Git as the underlying platform for experiment tracking instead of a - web application. - -- DVC doesn't need to run any services. There's no graphical user interface as a - result, but we expect some GUI services will be created on top of DVC. - -- DVC has transparent design. Its - [internal files and directories](/doc/user-guide/dvc-files-and-directories) - (including the cache directory) have a human-readable format and - can be easily reused by external tools. - -### Git workflows/methodologies such as Gitflow - -- DVC supports a new experimentation methodology that integrates easily with a - Git workflow. A separate branch can be created for each experiment, with a - subsequent merge of the branch if the experiment was successful. - -- DVC innovates by giving experimenters the ability to easily navigate through - past experiments without recomputing them each time. - -### Build automation tools - -[Make](https://www.gnu.org/software/make/) and others. - -- DVC utilizes a - [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) - (DAG): - - - The DAG or dependency graph is defined implicitly by the connections between - [DVC-files](/doc/user-guide/dvc-file-format) (with file names `.dvc` - or `Dvcfile`), based on their dependencies and outputs. - - - Each DVC-file defines one node in the DAG. All DVC-files in a repository - make up a single pipeline (think a single Makefile). All DVC-files (and - corresponding pipeline commands) are implicitly combined through their - inputs and outputs, simplifying conflict resolution during merges. - - - DVC provides a simple command – `dvc run` – to generate a DVC-file or "stage - file" automatically, based on the provided command, dependencies, and - outputs. - -- File tracking: - - - DVC tracks files based on their hashes (MD5) instead of file timestamps. - This helps avoid running into heavy processes like model retraining when you - checkout a previously trained version of a model (Make would retrain the - model). - - - DVC uses file timestamps and inodes for optimization. This allows DVC to - avoid recomputing all dependency file hashes, which would be highly - problematic when working with large files (10 GB+). - -### Git-annex - -- DVC uses the idea of storing the content of large files (that you don't want - to see in your Git repository) in a local key-value store and uses file - symlinks instead of the actual files. - -- DVC can use reflinks\* or hardlinks (depending on the system) instead of - symlinks to improve performance and the user experience. - -- DVC optimizes file hash calculation. - -- Git-annex is a datafile-centric system whereas DVC is focused on providing a - workflow for machine learning and reproducible experiments. When a DVC or - Git-annex repository is cloned via `git clone`, data files won't be copied to - the local machine, as file contents are stored in separate - [remotes](/doc/command-reference/remote). With DVC, - [DVC-files](/doc/user-guide/dvc-file-format), which provide the reproducible - workflow, are always included in the Git repository. Hence, they can be - executed locally with minimal effort. - -- DVC is not fundamentally bound to Git, and users have the option of using DVC - without SCM. - -### Git-LFS (Large File Storage) - -- DVC does not require special Git servers like Git-LFS demands. Any cloud - storage like S3, GCS, or an on-premises SSH server can be used as a backend - for datasets and models. No additional databases, servers, or infrastructure - are required. - -- DVC is not fundamentally bound to Git, and users have the option of using DVC - without SCM. - -- DVC does not add any hooks to the Git repo by default. To checkout data files, - the `dvc checkout` command has to be run after each `git checkout` and - `git clone` command. It gives more granularity on managing data and code - separately. Hooks could be configured to make workflows simpler. - -- DVC attempts to use reflinks\* and has other - [file linking options](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache). - This way the `dvc checkout` command does not actually copy data files from - cache to the workspace, as copying files is a heavy - operation for large files (30 GB+). - -- `git-lfs` was not made with data science scenarios in mind, so it does not - provide related features (e.g. pipelines, - [metrics](/doc/command-reference/metrics)), and thus GitHub has a limit of 2 - GB per repository. - ---- - -> \***copy-on-write links or "reflinks"** are a relatively new way to link files -> in UNIX-style file systems. Unlike hardlinks or symlinks, they support -> transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This -> means that editing a reflinked file is always safe as all the other links to -> the file will reflect the changes. diff --git a/content/docs/understanding-dvc/what-is-dvc.md b/content/docs/understanding-dvc/what-is-dvc.md deleted file mode 100644 index 39aab4b8e9..0000000000 --- a/content/docs/understanding-dvc/what-is-dvc.md +++ /dev/null @@ -1,57 +0,0 @@ -# What Is DVC? - -Data Version Control, or DVC, is **a new type of experiment management -software** that has been built **on top of the existing engineering toolset that -you're already used to**, and particularly on a source code version control -system (currently Git). DVC reduces the gap between existing tools and data -science needs, allowing users to take advantage of experiment management -software while reusing existing skills and intuition. - -The underlying source code control system eliminates the need to use external -services. Data science experiment sharing and collaboration can be done through -regular Git tools (commit messages, merges, pull requests, etc) the same way it -works for software engineers. - -DVC implements a **Git experimentation methodology** where each experiment -exists with its code as well as data, and can be represented as a separate Git -branch or commit. - -DVC uses a few core concepts: - -- **Experiment**: Equivalent to a - [Git revision](https://git-scm.com/docs/revisions). Each experiment (extract - new features, change model hyperparameters, data cleaning, add a new data - source) can be performed in a separate branch or tag. DVC allows experiments - to be integrated into a Git repository history and never needs to recompute - the results after a successful merge. - -- **Experiment state** or state: Equivalent to a Git snapshot (all committed - files). A Git commit hash, branch or tag name, etc. can be used as a - [reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) to an - experiment state. - -- **Reproducibility**: Action to reproduce an experiment state. This action - generates output files (or directories) based on a set of input files and - source code. This action usually changes experiment state. - -- **Pipeline**: Dependency graph or series of commands to reproduce data - processing results. The commands are connected by their inputs - (dependencies) and outputs. Pipelines are defined by - special [stage files](/doc/command-reference/run) (similar to - [Makefiles](https://www.gnu.org/software/make/manual/make.html#Introduction)). - Refer to [pipeline](/doc/command-reference/pipeline) for more information. - -- **Workflow**: Set of experiments and relationships among them. Workflow - corresponds to the entire Git repository. - -- **Data files**: Cached files (for large files). Data files are stored outside - of the Git repository on a local/shared hard drive or remote storage, but - [DVC-files](/doc/user-guide/dvc-file-format) describing that data are stored - in Git for DVC needs (to maintain pipelines and reproducibility). - -- **Cache directory**: Directory with all data files on a local hard drive or in - cloud storage, but not in the Git repository. See `dvc cache dir`. - -- **Cloud storage** support: available complement to the core DVC features. This - is how a data scientist transfers large data files or shares a GPU-trained - model with those without GPUs available. diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md index 0f522a8326..782dfc9211 100644 --- a/content/docs/use-cases/index.md +++ b/content/docs/use-cases/index.md @@ -1,5 +1,14 @@ # Use Cases +# Collaboration Issues in Data Science + +Even with all the success we've seen today in machine learning (ML), +specifically deep learning and its applications in business, the data science +community still lacks good practices for organizing their projects and +effectively collaborating across their varied ML projects. This is a critical +challenge: we need to evolve towards ML algorithms and methods no longer being +tribal knowledge and making them easy to implement, reuse, and manage. + We provide short articles on common ML workflow or data management scenarios that DVC can help with or improve. These include the motivating context (usually extracted from real-life cases); And the approaches to solving them can combine diff --git a/content/docs/understanding-dvc/how-it-works.md b/content/docs/user-guide/how-it-works.md similarity index 100% rename from content/docs/understanding-dvc/how-it-works.md rename to content/docs/user-guide/how-it-works.md diff --git a/content/docs/user-guide/index.md b/content/docs/user-guide/index.md index e51926f40b..b3f08aa14b 100644 --- a/content/docs/user-guide/index.md +++ b/content/docs/user-guide/index.md @@ -1,5 +1,21 @@ # User Guide +Data Version Control, or DVC, is **a new type of experiment management +software** that has been built **on top of the existing engineering toolset that +you're already used to**, and particularly on a source code version control +system (currently Git). DVC reduces the gap between existing tools and data +science needs, allowing users to take advantage of experiment management +software while reusing existing skills and intuition. + +The underlying source code control system eliminates the need to use external +services. Data science experiment sharing and collaboration can be done through +regular Git tools (commit messages, merges, pull requests, etc) the same way it +works for software engineers. + +DVC implements a **Git experimentation methodology** where each experiment +exists with its code as well as data, and can be represented as a separate Git +branch or commit. + Our guides describe the main DVC concepts and features comprehensively, explaining when and how to use them, as well as connections between them. These guides don't focus on specific scenarios, but have a general scope – like a user diff --git a/content/docs/understanding-dvc/resources.md b/content/docs/user-guide/resources.md similarity index 100% rename from content/docs/understanding-dvc/resources.md rename to content/docs/user-guide/resources.md From 3cea8aa7caa79d454b4faf43b0d184d8493fdc8d Mon Sep 17 00:00:00 2001 From: V Abhijith Rao Date: Wed, 20 May 2020 00:03:05 +0530 Subject: [PATCH 2/2] Update index.md --- content/docs/use-cases/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md index 782dfc9211..e581f665a8 100644 --- a/content/docs/use-cases/index.md +++ b/content/docs/use-cases/index.md @@ -1,6 +1,6 @@ # Use Cases -# Collaboration Issues in Data Science +## Collaboration Issues in Data Science Even with all the success we've seen today in machine learning (ML), specifically deep learning and its applications in business, the data science