tutorial: NFS/volumes #561

DavidGOrtega · 2021-05-25T20:42:56Z

Client-side sibling of iterative/terraform-provider-iterative#89

Security
External storage

Until we have ~~iterative/terraform-provider-iterative#107~~ iterative/terraform-provider-iterative#123 and iterative/terraform-provider-iterative#89 we could offer some recipes that we have crafted as proposed solutions to some users in the Discord channel.
We would need to create those simple scenarios in the docs as a FAQ?

DavidGOrtega · 2021-05-25T20:44:23Z

I think that the best idea could be to put everything under collapsible blocks under a FAQ in the readme.

0x2b3bfa0 · 2021-05-27T17:25:16Z

https://dvc.org/doc/user-guide/managing-external-data#setting-up-an-external-cache?

DavidGOrtega · 2021-05-31T07:52:08Z

https://dvc.org/doc/user-guide/managing-external-data#setting-up-an-external-cache?

@0x2b3bfa0

DVC requires that the project's cache is configured in the same external location as the data that will be tracked (external outputs).

We need the storage first? 🤔 Probably @casperdcl can help here

casperdcl · 2021-06-01T16:33:48Z

could still have CI cache .dvc/cache, if that's what you mean?

DavidGOrtega · 2021-06-01T16:55:53Z

@casperdcl Im not sure. What we originally did or hack in Discord was to attach a volume or an NFS storage.
I was guessing that @0x2b3bfa0 was actually referring to that.

0x2b3bfa0 · 2021-06-01T17:06:40Z

It should be technically feasible with something like this:
sudo apt install nfs-common
sudo mount -t nfs EFS_IP_ADDRESS:/ MOUNTPOINT
(From Discord)

Using NFS storage for a cache might not be an optimal solution due to latency and file transfer times. AWS EFS is fast, but not that fast.

0x2b3bfa0 · 2021-06-01T17:12:05Z

DVC already supports cache over a variety of network transports and, if we plan to offer alternative solutions, they should be as local and as fast as possible. Probably block-based, mounted to the runner machine itself, and without any control over their lifecycle: exactly iterative/terraform-provider-iterative#89.

In the meanwhile, we can mention https://dvc.org/doc/user-guide/managing-external-data#setting-up-an-external-cache and, perhaps, tell users how to use an external NFS shared cache as per the hack above.

0x2b3bfa0 · 2021-06-04T12:11:20Z

Security requires a bit more attention than a list of officially recommended workarounds: see iterative/terraform-provider-iterative#125

casperdcl · 2021-06-04T19:09:47Z

DVC already supports cache over a variety of network transports

are you talking about https://dvc.org/doc/user-guide/managing-external-data#examples? If that's the case, just to clarify... remote caches are only useful for external remote outputs (dvc exp run --external) which isn't a use case we're discussing atm afaik.

So for our purposes DVC does not support cache over network

shcheklein · 2021-06-06T02:33:14Z

So for our purposes DVC does not support cache over network

We would need to clarify "over network" here. I think we are on the same page, but to be precise - DVC cache supports anything that can be mounted as a volume and symlinked/copied from it into workspace. It means that it can be NAS (and we had teams with 70TB cache organized this way). But we can't do something like dvc cache dir s3://dvc/storage at the moment. This is what remotes are for now (though as far as I understand something like dvc cache dir s3://dvc/storage should be easy to implement).

0x2b3bfa0 · 2021-06-06T16:06:15Z

Cache should not hold any data that isn't present on the DVC remote: it should be just a faster and potentially cheaper place to store a reasonably updated copy of the data.
Does the DVC cache support read/write access from several instances at the same time? If not, the shared cache concept on Resource Volumes - Shareable and reusable volumes terraform-provider-iterative#89 would not be feasible.

TL;DR

There is so much confusion on this point that even we often don't get it iterative/dvc.org#520 (comment)

Touché, @casperdcl! 😄 It looks like my cursory investigation wasn't enough to emit an educated opinion on this topic. 🙈

What I recommended on Discord — see #561 (comment) for more information — was just a way of moving the local cache to a NFS share, like the ones provided by AWS, Azure or GCP. It's slow for what users would expect of a cache ^{[citatation needed]} but, at least, serves as an intermediate storage to avoid querying the main DVC remote every time CML launches a new instance.

After reviewing the documentation, I noticed that --external would use data in situ from a remote storage other than the DVC remote, without pushing data to the latter under any circumstances. I guess that this is not what we want: the main DVC remote should always be the single source of truth for data, and caches, regradless of the implementation details, should only be a convenience storage to accelerate and optimize data transfer operations.

Our reusable cache should be faster and probably cheaper than the remote, both on sequential and random access; otherwise, it would be better to query the remote directly. It may sound like a lapalissade, but it's an important point to consider when choosing the storage type.

shcheklein · 2021-06-07T16:48:59Z

Does the DVC cache support read/write access from several instances at the same time? If not, the shared cache concept on iterative/terraform-provider-iterative#89 would not be feasible.

yes, it supports multiple clients

Cache should not hold any data that isn't present on the DVC remote: it should be just a faster and potentially cheaper place to store a reasonably updated copy of the data.

it's a bit more nuanced. There are teams that don't use remotes at all :) But otherwise you are right.

0x2b3bfa0 · 2021-06-07T23:04:29Z

Thank you very much for shedding a bit of light on this, @shcheklein! 🙏🏼

yes, it supports multiple clients

Awesome! 🎉

it's a bit more nuanced. There are teams that don't use remotes at all

Makes sense after thinking on the details, though calling it cache in the external data use case might not be too intuitive, even if the working principle is the same. 🙃 Thanks for the clarification!

0x2b3bfa0 · 2021-06-07T23:25:35Z

as far as I understand something like dvc cache dir s3://dvc/storage should be easy to implement

This is exactly what I was looking for! We don't need it for NFS, as it can be regarded as any other mounted filesystem, but it would be a great addition for other storage systems that can't be mounted at the system level in any meaningful way, like S3.

Before proposing the implementation of such a feature, I would also like to point out that we could resort to somewhat mountable filesystems like HDFS or Lustre, which seem a good fit for this kind of use case:

Many workloads such as machine learning, high performance computing (HPC), video rendering, and financial simulations depend on compute instances accessing the same set of data through high-performance shared storage (AWS FSx)

The pity is that this kind of solution is not supported on every cloud without a healthy dose of contrived manual deployments. As users will need to configure it by themselves, we probably need to consider availability and ease of use as part of the main comparison points.

casperdcl · 2021-06-29T18:22:55Z

Ok so action point:

@0x2b3bfa0 would be great if you could put together a performant NFS/volume example repo targeting use case of extremely large dependencies (>1TB, where users won't want to dvc get/curl/aws cp etc). This is only for proof of concept rather than targetting every single possible user config/setup.

The checkpoint cache stuff is a different issue (#390).

0x2b3bfa0 · 2021-06-29T18:56:45Z

performant NFS

May I add it as a relevant example for the Wikipedia page of oxymoron? 😈 I'll follow up later with a comparison of all the solutions we've talked about and some examples.

0x2b3bfa0 · 2021-06-30T00:59:55Z

Storage requirements

As fast as the average DVC remote, at least.
Able to persist data between consecutive runs.
Accessible by several machines at the same time.
Available on all the public clouds we plan to support.

General storage types

	Ephemeral	Block	Object	File
Can offer transfer speeds comparable to hard disks?	✅	✅	🟠	🔴
Can be reused from different machines, one at a time?	🚫	✅	✅	✅
Can be accessed by many machines at the same time?	🚫	🚫	✅	✅
Can be mounted at the CI/CD level?	🚫	🚫	🟠	🟠

We^{[citation needed]} have been using the word volumes since the beginning of this issue — not to mention iterative/terraform-provider-iterative#89 — but the concept of volumes is tightly related to block-based storage. Unfortunately, block-based storage can't be accessed by several machines at the same time, so our only possible choices are object-based storage and some kinds of distributed file-based storage.

Specific storage types

Object-based

	AWS	Azure	GCP
Name	S3	Blob Storage	Cloud Storage
Alleged speed	12 GB/s	12.5 GB/s	N/A
Mountable with	✅ s3fs-fuse	✅ azure-storage-fuse	✅ gcsfuse
Transit encryption	✅ HTTPS	✅ TLS 1.2	✅ TLS 1.3
Authentication	Token	Token	Service account

File-based

	AWS	Azure	GCP
Name	EFS	Files	Filestore
Alleged speed*	50 MB/s	60 MB/s	100 MB/s
Mountable with	✅ NFSv4	✅ NFSv4	✅ NFSv3
Transit encryption†	✅ TLS 1.2	🚫 NO	🚫 N/A
Authentication†	🚫 NO	🚫 N/A	🚫 NO

Others

While other distributed filesystems like HDFS or Lustre might be a good option in some scenarios, they haven't been widely adopted by the popular public clouds. AWS FSx looks really good, but isn't portable.

Recommended reads

Why NFS Sucks

* Alleged base speed; it gets better if you store > 10 terabytes in some providers or pay additional burst speed credits.
† Not that important if the NFS service can only be accessed through the local network, but that would require ClickOps.

0x2b3bfa0 · 2021-06-30T01:19:05Z

🔔 @iterative/cml, I'll write some FUSE examples as soon as I make sure that nobody has a personal preference for NFS. ⚔️

0x2b3bfa0 · 2021-07-02T16:43:38Z

Create cml-561.yml cml-playground#247

0x2b3bfa0 · 2021-07-07T15:51:18Z

Note: requires additional discussion and, probably, will be merged with iterative/terraform-provider-iterative#89

casperdcl · 2021-08-10T18:28:14Z

related: iterative/dvc.org#2587

0x2b3bfa0 · 2021-08-21T16:08:00Z

Mounting FUSE devices inside a container is still not possible without the SYS_ADMIN capability and some extra privileges:

Linux already supports unprivileged userspace mounts as per torvalds/linux@4ad769f3c346 and we're just missing support from container runtimes.

0x2b3bfa0 · 2021-08-21T16:26:39Z

In the meantime, we can mount the filesystem at the instance level (on machines) or with additional privileges (on containers), but this could have a negative impact on container isolation.

0x2b3bfa0 · 2021-08-21T16:34:10Z

The question is: does attaching object-based or file-based storage make any sense if we take into account the limitations exposed on iterative/dvc.org#2587?

Attaching this kind of storage would be approximately as practical as pulling/pushing data with dvc to any of the supported remotes, and the only difference would be that data manipulation would be done in situ without requiring a local scratchpad.

0x2b3bfa0 · 2021-10-25T23:29:45Z

Real–life performance is between 30 and 50 MiB/s with both rclone and FUSE on all the supported cloud providers. Might be network–bound, though.

0x2b3bfa0 · 2021-11-23T21:36:12Z

Getting peaks of ~500 MiB/s (yes, the ISO 80000-13 ones) on S3 with a beefy c5a.24xlarge instance after fine–tuning rclone settings.

Closed with iterative/terraform-provider-iterative#237

0x2b3bfa0 · 2022-12-20T05:36:00Z

iterative/cml-playground#247

DavidGOrtega assigned 0x2b3bfa0 May 25, 2021

DavidGOrtega closed this as completed May 25, 2021

DavidGOrtega reopened this May 26, 2021

0x2b3bfa0 mentioned this issue Jun 3, 2021

Network architecture iterative/terraform-provider-iterative#123

Closed

4 tasks

casperdcl mentioned this issue Jun 5, 2021

guide: consolidate external data mgmt guides iterative/dvc.org#520

Closed

8 tasks

casperdcl mentioned this issue Jun 7, 2021

guide: deobfuscate Managing External Data iterative/dvc.org#2542

Closed

DavidGOrtega closed this as completed Jun 8, 2021

DavidGOrtega reopened this Jun 8, 2021

dmpetrov changed the title ~~cml-runner complex scenarios examples~~ Networking: cml-runner complex scenarios examples Jun 15, 2021

dmpetrov changed the title ~~Networking: cml-runner complex scenarios examples~~ Research: Attaching volumes to instances Jun 15, 2021

0x2b3bfa0 added documentation Markdown files enhancement New feature or request research Waiting for team investigation and removed enhancement New feature or request labels Jun 16, 2021

0x2b3bfa0 changed the title ~~Research: Attaching volumes to instances~~ Attaching volumes to instances Jun 16, 2021

casperdcl changed the title ~~Attaching volumes to instances~~ tutorial: NFS/volumes Jun 29, 2021

0x2b3bfa0 mentioned this issue Jul 4, 2021

Resource Volumes - Shareable and reusable volumes iterative/terraform-provider-iterative#89

Closed

0x2b3bfa0 closed this as completed Nov 23, 2021

0x2b3bfa0 mentioned this issue Dec 3, 2021

task bucket usage vs "directory" within a bucket iterative/terraform-provider-iterative#299

Closed

casperdcl mentioned this issue Dec 8, 2021

filesystem: support btrfs/xfs iterative/terraform-provider-iterative#329

Open

0x2b3bfa0 mentioned this issue Dec 20, 2022

Create cml-561.yml iterative/cml-playground#247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorial: NFS/volumes #561

tutorial: NFS/volumes #561

DavidGOrtega commented May 25, 2021 •

edited by 0x2b3bfa0

Loading

DavidGOrtega commented May 25, 2021 •

edited

Loading

0x2b3bfa0 commented May 27, 2021

DavidGOrtega commented May 31, 2021

casperdcl commented Jun 1, 2021

DavidGOrtega commented Jun 1, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 1, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 1, 2021

0x2b3bfa0 commented Jun 4, 2021 •

edited

Loading

casperdcl commented Jun 4, 2021 •

edited

Loading

shcheklein commented Jun 6, 2021

0x2b3bfa0 commented Jun 6, 2021 •

edited

Loading

shcheklein commented Jun 7, 2021

0x2b3bfa0 commented Jun 7, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 7, 2021 •

edited

Loading

casperdcl commented Jun 29, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 29, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 30, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 30, 2021 •

edited

Loading

0x2b3bfa0 commented Jul 2, 2021 •

edited

Loading

0x2b3bfa0 commented Jul 7, 2021

casperdcl commented Aug 10, 2021

0x2b3bfa0 commented Aug 21, 2021 •

edited

Loading

0x2b3bfa0 commented Aug 21, 2021

0x2b3bfa0 commented Aug 21, 2021

0x2b3bfa0 commented Oct 25, 2021 •

edited

Loading

0x2b3bfa0 commented Nov 23, 2021

0x2b3bfa0 commented Dec 20, 2022

tutorial: NFS/volumes #561

tutorial: NFS/volumes #561

Comments

DavidGOrtega commented May 25, 2021 • edited by 0x2b3bfa0 Loading

Client-side sibling of iterative/terraform-provider-iterative#89

DavidGOrtega commented May 25, 2021 • edited Loading

0x2b3bfa0 commented May 27, 2021

DavidGOrtega commented May 31, 2021

casperdcl commented Jun 1, 2021

DavidGOrtega commented Jun 1, 2021 • edited Loading

0x2b3bfa0 commented Jun 1, 2021 • edited Loading

0x2b3bfa0 commented Jun 1, 2021

0x2b3bfa0 commented Jun 4, 2021 • edited Loading

casperdcl commented Jun 4, 2021 • edited Loading

shcheklein commented Jun 6, 2021

0x2b3bfa0 commented Jun 6, 2021 • edited Loading

TL;DR

shcheklein commented Jun 7, 2021

0x2b3bfa0 commented Jun 7, 2021 • edited Loading

0x2b3bfa0 commented Jun 7, 2021 • edited Loading

casperdcl commented Jun 29, 2021 • edited Loading

0x2b3bfa0 commented Jun 29, 2021 • edited Loading

0x2b3bfa0 commented Jun 30, 2021 • edited Loading

Storage requirements

General storage types

Specific storage types

Object-based

File-based

Others

Recommended reads

0x2b3bfa0 commented Jun 30, 2021 • edited Loading

0x2b3bfa0 commented Jul 2, 2021 • edited Loading

0x2b3bfa0 commented Jul 7, 2021

casperdcl commented Aug 10, 2021

0x2b3bfa0 commented Aug 21, 2021 • edited Loading

0x2b3bfa0 commented Aug 21, 2021

0x2b3bfa0 commented Aug 21, 2021

0x2b3bfa0 commented Oct 25, 2021 • edited Loading

0x2b3bfa0 commented Nov 23, 2021

0x2b3bfa0 commented Dec 20, 2022

DavidGOrtega commented May 25, 2021 •

edited by 0x2b3bfa0

Loading

DavidGOrtega commented May 25, 2021 •

edited

Loading

DavidGOrtega commented Jun 1, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 1, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 4, 2021 •

edited

Loading

casperdcl commented Jun 4, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 6, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 7, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 7, 2021 •

edited

Loading

casperdcl commented Jun 29, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 29, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 30, 2021 •

edited

Loading

0x2b3bfa0 commented Jun 30, 2021 •

edited

Loading

0x2b3bfa0 commented Jul 2, 2021 •

edited

Loading

0x2b3bfa0 commented Aug 21, 2021 •

edited

Loading

0x2b3bfa0 commented Oct 25, 2021 •

edited

Loading