-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tutorial: NFS/volumes #561
Comments
I think that the best idea could be to put everything under collapsible blocks under a FAQ in the readme. |
We need the storage first? 🤔 Probably @casperdcl can help here |
could still have CI cache |
@casperdcl Im not sure. What we originally did or hack in Discord was to attach a volume or an NFS storage. |
Using NFS storage for a cache might not be an optimal solution due to latency and file transfer times. AWS EFS is fast, but not that fast. |
DVC already supports cache over a variety of network transports and, if we plan to offer alternative solutions, they should be as local and as fast as possible. Probably block-based, mounted to the runner machine itself, and without any control over their lifecycle: exactly iterative/terraform-provider-iterative#89. In the meanwhile, we can mention https://dvc.org/doc/user-guide/managing-external-data#setting-up-an-external-cache and, perhaps, tell users how to use an external NFS shared cache as per the hack above. |
Security requires a bit more attention than a list of officially recommended workarounds: see iterative/terraform-provider-iterative#125 |
are you talking about https://dvc.org/doc/user-guide/managing-external-data#examples? If that's the case, just to clarify... remote caches are only useful for external remote outputs ( So for our purposes DVC does not support cache over network |
We would need to clarify "over network" here. I think we are on the same page, but to be precise - DVC cache supports anything that can be mounted as a volume and symlinked/copied from it into workspace. It means that it can be NAS (and we had teams with 70TB cache organized this way). But we can't do something like |
TL;DR
Touché, @casperdcl! 😄 It looks like my cursory investigation wasn't enough to emit an educated opinion on this topic. 🙈 What I recommended on Discord — see #561 (comment) for more information — was just a way of moving the local cache to a NFS share, like the ones provided by AWS, Azure or GCP. It's slow for what users would expect of a cache [citatation needed] but, at least, serves as an intermediate storage to avoid querying the main DVC remote every time CML launches a new instance. After reviewing the documentation, I noticed that Our reusable cache should be faster and probably cheaper than the remote, both on sequential and random access; otherwise, it would be better to query the remote directly. It may sound like a lapalissade, but it's an important point to consider when choosing the storage type. |
yes, it supports multiple clients
it's a bit more nuanced. There are teams that don't use remotes at all :) But otherwise you are right. |
Thank you very much for shedding a bit of light on this, @shcheklein! 🙏🏼
Awesome! 🎉
Makes sense after thinking on the details, though calling it cache in the external data use case might not be too intuitive, even if the working principle is the same. 🙃 Thanks for the clarification! |
This is exactly what I was looking for! We don't need it for NFS, as it can be regarded as any other mounted filesystem, but it would be a great addition for other storage systems that can't be mounted at the system level in any meaningful way, like S3. Before proposing the implementation of such a feature, I would also like to point out that we could resort to somewhat mountable filesystems like HDFS or Lustre, which seem a good fit for this kind of use case: The pity is that this kind of solution is not supported on every cloud without a healthy dose of contrived manual deployments. As users will need to configure it by themselves, we probably need to consider availability and ease of use as part of the main comparison points. |
Ok so action point: @0x2b3bfa0 would be great if you could put together a performant NFS/volume example repo targeting use case of extremely large dependencies (>1TB, where users won't want to The checkpoint cache stuff is a different issue (#390). |
May I add it as a relevant example for the Wikipedia page of oxymoron? 😈 I'll follow up later with a comparison of all the solutions we've talked about and some examples. |
Storage requirements
General storage types
We[citation needed] have been using the word volumes since the beginning of this issue — not to mention iterative/terraform-provider-iterative#89 — but the concept of volumes is tightly related to block-based storage. Unfortunately, block-based storage can't be accessed by several machines at the same time, so our only possible choices are object-based storage and some kinds of distributed file-based storage. Specific storage typesObject-based
File-based
OthersWhile other distributed filesystems like HDFS or Lustre might be a good option in some scenarios, they haven't been widely adopted by the popular public clouds. AWS FSx looks really good, but isn't portable. Recommended reads* Alleged base speed; it gets better if you store > 10 terabytes in some providers or pay additional burst speed credits. |
🔔 @iterative/cml, I'll write some FUSE examples as soon as I make sure that nobody has a personal preference for NFS. ⚔️ |
Note: requires additional discussion and, probably, will be merged with iterative/terraform-provider-iterative#89 |
related: iterative/dvc.org#2587 |
Mounting FUSE devices inside a container is still not possible without the
Linux already supports unprivileged userspace mounts as per torvalds/linux@4ad769f3c346 and we're just missing support from container runtimes. |
In the meantime, we can mount the filesystem at the instance level (on machines) or with additional privileges (on containers), but this could have a negative impact on container isolation. |
The question is: does attaching object-based or file-based storage make any sense if we take into account the limitations exposed on iterative/dvc.org#2587? Attaching this kind of storage would be approximately as practical as pulling/pushing data with |
Closed with iterative/terraform-provider-iterative#237 |
Client-side sibling of iterative/terraform-provider-iterative#89
Until we have
iterative/terraform-provider-iterative#107iterative/terraform-provider-iterative#123 and iterative/terraform-provider-iterative#89 we could offer some recipes that we have crafted as proposed solutions to some users in the Discord channel.We would need to create those simple scenarios in the docs as a FAQ?
The text was updated successfully, but these errors were encountered: