diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md index 971ad2e99f..bc316c479b 100644 --- a/content/docs/use-cases/index.md +++ b/content/docs/use-cases/index.md @@ -1,18 +1,17 @@ # Use Cases -We provide short articles on common ML workflow or data management scenarios -that DVC can help with or improve. Our use cases are not written to be run -end-to-end like tutorials. For more general, hands-on experience with DVC, -please see our [Get Started](/doc/tutorials/get-started) instead. +We provide short articles on common ML workflow and data science use cases that +DVC can help with or improve. Our use cases are not written to be run end-to-end +like tutorials. For more general, hands-on experience with DVC, please see +[Get Started](/doc/tutorials/get-started) instead. ## Why DVC? Even with all the success we've seen today in machine learning (ML), especially -with deep learning and its applications in business, the data science community -still lacks good practices for organizing their projects and collaborating -effectively. This is a critical challenge: while ML algorithms and methods are -no longer tribal knowledge, they are still difficult to implement, reuse, and -manage. +with deep learning and its applications in business, data scientists still lack +best practices for organizing their projects and collaborating effectively. This +is a critical challenge: while ML algorithms and methods are no longer tribal +knowledge, they are still difficult to implement, reuse, and manage. ## Basic uses of DVC @@ -20,10 +19,11 @@ If you store and process data files or datasets to produce other data or machine learning models, and you want to - capture and save data artifacts the same way you capture code; -- track and switch between different versions of data or models easily; -- understand how data or models were built in the first place; -- be able to compare models and metrics to each other; -- bring software engineering best practices to your data science team +- track, control, and switch between different versions of data or models + easily; +- understand how data or ML models were built in the first place; +- compare machine learning models and metrics to each other; +- bring software engineering best practices and tools to your data science team DVC is for you! diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index be28edd905..448bdeab55 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -11,8 +11,8 @@ pull requests, etc.) To actually store the data, DVC uses a built-in cache, and supports synchronizing it with various types of -[remote storage](/doc/command-reference/remote). This allows storing and sharing -data easily, and alongside code. +[remote storage](/doc/command-reference/remote). This allows for easy data and +model versioning, storage, and sharing — right alongside code. ![](/img/model-versioning-diagram.png) _Code and data flows in DVC_ @@ -30,9 +30,9 @@ on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider ## DVC is not Git! DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track -data files and directories (among other purposes). They point to specific data -contents in the cache, providing the ability to store multiple data -versions out-of-the-box. +data files and directories for versioning (among other purposes). They point to +specific data contents in the cache, providing the ability to store +multiple data versions out-of-the-box. Full-fledged [version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) @@ -46,7 +46,7 @@ several other novel features (see [Get Started](/doc/start/) for a primer.) Let's say you have an empty DVC repository and put a dataset of images in the `images/` directory. You can start tracking it with `dvc add`. -This generate a `.dvc` file, which can be committed to Git in order to save the +This generates a `.dvc` file, which can be committed to Git in order to save the project's version: ```dvc @@ -116,7 +116,8 @@ M model.pkl ``` However, we can checkout certain parts only, for example if we want to keep the -latest source code and model but rewind to the previous dataset only: +latest source code and model versions, but rewind to the previous version of the +dataset: ```dvc $ git checkout v1.0 images.dvc @@ -125,5 +126,5 @@ M images ``` DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by -avoiding copying files each time, so checking out data is quick even if you have -large data files. +avoiding copying files each time, so checking out data is quick even if you are +versioning large data files. diff --git a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md index ad8c5a628e..20d6766ca3 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md +++ b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md @@ -1,8 +1,8 @@ -# Tutorial: Versioning +# Tutorial: Data & Model Versioning The goal of this example is to give you some hands-on experience with a basic -machine learning version control scenario: working with multiple versions of -datasets and ML models using DVC commands. We'll work with a +machine learning version control scenario: managing multiple datasets and ML +model versions using DVC commands. We'll work with a [tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html) that [François Chollet](https://twitter.com/fchollet) put together to show how to build a powerful image classifier using a pretty small dataset. @@ -237,9 +237,9 @@ $ git commit -m "Second model, trained with 2000 images" $ git tag -a "v2.0" -m "model v2.0, 2000 images" ``` -That's it! We have tracked a second dataset, model, and metrics versioned DVC, -and the DVC-files that point to them committed with Git. Let's now look at how -DVC can help us go back to the previous version if we need to. +That's it! We've tracked a second version of the dataset, model, and metrics in +DVC and committed the DVC-files that point to them with Git. Let's now look at +how DVC can help us go back to the previous version if we need to. ## Switching between workspace versions @@ -338,15 +338,15 @@ changed. For example, when we added new images to built the second version of our model, that was a dependency change. It also updates outputs and puts them into the cache. -To make things a little simpler: if `dvc add` and `dvc checkout` provide a basic -mechanism to version control large data files or models, `dvc run` and -`dvc repro` provide a build system for ML models, which is similar to +To make things a little simpler: `dvc add` and `dvc checkout` provide a basic +mechanism for model and large dataset versioning. `dvc run` and `dvc repro` +provide a build system for machine learning models, which is similar to [Make](https://www.gnu.org/software/make/) in software build automation. ## What's next? -In this example, our focus was on giving you hands-on experience with versioning -ML models and datasets. We specifically looked at the `dvc add` and +In this example, our focus was on giving you hands-on experience with dataset +and ML model versioning. We specifically looked at the `dvc add` and `dvc checkout` commands. We'd also like to outline some topics and ideas you might be interested to try next to learn more about DVC and how it makes managing ML projects simpler. diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index 18f86f0acb..ab7e2c2753 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -1,6 +1,6 @@ # What Is DVC? -**Data Version Control** is a new type of data versioning, workflow and +**Data Version Control** is a new type of data versioning, workflow, and experiment management software, that builds upon [Git](https://git-scm.com/) (although it can work stand-alone). DVC reduces the gap between established engineering tool sets and data science needs, allowing users to take advantage @@ -10,7 +10,8 @@ of new [features](#core-features) while reusing existing skills and intuition. Data science experiment sharing and collaboration can be done through a regular Git flow (commits, branching, pull requests, etc.), the same way it works for -software engineers. +software engineers. Using Git and DVC, data science and machine learning teams +can version experiments, manage large datasets, and make projects reproducible. ## Core Features @@ -22,7 +23,7 @@ software engineers. [versioning](/doc/use-cases/versioning-data-and-model-files) capabilities. - **Data versioning** is enabled by replacing large files, dataset directories, - ML models, etc. with small + machine learning models, etc. with small [metafiles](/doc/user-guide/dvc-files-and-directories) (easy to handle with Git). These placeholders point to the original data, which is decoupled from source code management.