-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task
bucket usage vs "directory" within a bucket
#299
Comments
LimitsAs per the research below, users can have at least 100 concurrent tasks running in any of the supported providers. Upon request, limits can be increased up to 250 per region in the worst case scenario. Highly parallelizable tasks, like hyperparameter optimization, can use the Moreover, the number of concurrent tasks is usually bound to the orchestration limits, not to the storage limits. Storage
|
DeletionTasks have been designed to be deleted as soon as they finish or, rather, as soon as the user realizes that they have finished; e.g. the next morning. Even in moderately big teams, it would be unusual to have more than 100 concurrent tasks when following this approach.1 Teams with more than ~20 data scientists will necessarily have to be backed by specialized DevOps engineers, who already have the ability of increasing those limits, and even overcoming them with workarounds.2 Footnotes
|
PersistenceTask storage is not meant to replace persistent data/model storage and versioning tools like DVC. It's only meant to share state (e.g. checkpoints) between several short–lived machines. Even DVC, which is a data–oriented tool, doesn't1 include a mechanism to create (and much less delete) persistent storage resources. This is something that most organizations prefer to manage separately and in often disparate ways. Because of the diversity and complexity involved in persistent storage provisioning, it would be risky to include such a feature as part of this project. LogsLikewise, task storage is not meant to replace log management platforms.2 For the same reasons as data, log ingestion, storage and monitoring should be performed by means of specialized tools. ResponsibilityInfrastructure management tools have the responsibility of deleting all the resources they create, even when those resources are meant to exist for long periods of time. Creating long–lived resources implies providing a way of deleting them.3 Footnotes
|
Here we go again! |
Who says? I can show you cases where 100 is super small. As we have been discussing a team with 3 models and a matrix can launch more than those 100 in a breeze and we do not have a perfect destroy in the CI yet |
As stated on #299 (comment), matrix use cases would benefit of |
Unfortunately, quota requests take time. In the meanwhile, #314 should be addressed in order to prevent the bug or, at least, to mitigate it. Let's demote this discussion to important after extracting the critical issue. |
This comment has been minimized.
This comment has been minimized.
I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account: buckets to use, security groups/firewall rules, instance role/service account, image/ami, etc. Many of these are already set/created for organizations and this tool is for helping to manage short-lived instances. It should manage as few "infrastructure" pieces as possible while still being easy to use/low barrier of entry for more agile users? |
Can you please write down an example of how does it looks like? I do not think that 0x2b3bfa0/cml-use-case-matrix-task resembles a useful or realistic case and I do not fully understand the usage of task there, we are using the task yo launch the runners, however this approach is not yet right (data sync is totally useless). To make such case interesting and useful the runners should recover the previous data folder once they start after the spot termination and that does not happens since the workdir changes in very runner startup. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ArchitectureTasks have been designed to be completely ephemeral, and able to run in a pristine cloud account without additional configuration. Treating storage1 as an ephemeral resource may be an unusual choice, but there is no other way of avoiding a separate installation process. If we wanted to use a shared bucket for all the tasks, it would make sense2 to embrace the official providers instead of writing our own, and just publish a module meant to deploy persistent resources — like an object storage bucket or an instance orchestrator — to be used by every task. GitHub recommends something similar, but it's still overcomplicated. Designing a “new” task orchestrator out of cloud primitives (virtual machine orchestrator, queue, log aggregator, object storage, et cetera) would imply reinventing the heptagonal wheel as previously stated, and ultimately lead us to consider nodeless Kubernetes solutions based on Elotl Kip (source code) and Cluster API spot instances. Footnotes
|
I'm inclined to think that it's fine to use ephemeral buckets to cache data and keep artifacts until users “harvest” them. Still, treating object storage buckets as an ephemeral resource looks like a pretty unusual practice, as @dmpetrov pointed out. Pinging @duijf, @JIoJIaJIu and @shcheklein for a
There might be other alternatives I overlooked, though. |
Please note,
This might be a path to an existing bucket with a path/key like user specified |
@dacbd would you prefer to have an ephemeral / temporary bucket like A separate question - would you prefer to keep output/state/logs of the task after the task is done or failed or remove the bucket or directory |
My 2 cents:
More background below :) Playing nice with existing infra
Very much agreed with this. Hopefully being able to specify existing resources isn't mutually exclusive with some sort of onboarding experience where TPI can abstract some stuff for users that are new to this. Buckets + quotasThe quota problems look pretty serious to me. Even if you can boost your quotas, I wouldn't bet on it that people would be happy to give up significant portions of their quota just for TPI. From the outside, it looks like a pretty arbitrary decision to have a bucket per task which also adds a lot of extra moving parts + papercuts. I would seriously consider going for directories in a bucket that already exists Cleanup
Cleanup everything seems like a good default, but this should probably be configurable. "Always cleanup", "Cleanup on success", "Never cleanup" all make sense to me. (Not sure if there is a usecase for "Only cleanup on failure") |
I have the same feeling. My understanding: you create and delete a bucket when you provision a new resource like a database or a new system deployment. An experiment / train-task is not a new resource, it is a just a run. Terraform backend might be a good analogy - it is using an existing path, and it does not destroy the bucket:
|
@dmpetrov I think that an ephemeral bucket is a fine default but given a path it should use a directory at the base of the path: and I'll reiterate @duijf in:
|
Thank you very much for the thorough feedback! ❤️ A good compromise would be adding a
ExamplesCloud providers
Kubernetes
Limitations
Still, leaving data in the cloud after destroying the task comes with some challeges. Not sure if we should support this persistency use case out of the box. |
Probably related to the API proposal on #307 (comment): now we have enough storage–related attributes to consider creating a block for all the storage–related attributes. |
I think that when using a predefined bucket it would be easy enough for the user to ensure persistence before running |
🤔 to my mind is quite opposite. I understand where the motivation comes from - TF destroy should clean up allocated resources. But I'm not sure how this is applied to our use case. I consider logs, data and config files that TFI copies to cloud as logs. And removing logs looks like a bit strange practice. TF itself also does not follow its own rules of destroying resources. See example with TF backend #299 (comment) It feels like we are introducing artificial rules here 🙂 I'd suggest providing maximum flexibility for users and not destroying logs until users directly ask for it (in config). |
related: |
We are generating a bucket with every task, however this approach has several outcomes and a bug:
bug
A better approach would be having specified a bucket by the user or if not create the default
.tpi
bucket.The text was updated successfully, but these errors were encountered: