`task` bucket usage vs "directory" within a bucket #299

DavidGOrtega · 2021-11-25T13:16:02Z

We are generating a bucket with every task, however this approach has several outcomes and a bug:

We are limited to the number of tasks
Removal is tricky and maybe not desirable by the user. They might want to keep the tasks for a long time
If your quota is depleted you can not delete tasks even to create new ones bug

A better approach would be having specified a bucket by the user or if not create the default .tpi bucket.

The text was updated successfully, but these errors were encountered:

0x2b3bfa0 · 2021-11-25T23:29:18Z

Limits

As per the research below, users can have at least 100 concurrent tasks running in any of the supported providers. Upon request, limits can be increased up to 250 per region in the worst case scenario.

Highly parallelizable tasks, like hyperparameter optimization, can use the parallelism argument to lauch up to thousands of machines with shared storage, without exceding any of these limits.

Moreover, the number of concurrent tasks is usually bound to the orchestration limits, not to the storage limits.

Storage

`aws`

By default, you can create up to 100 buckets in each of your AWS accounts. If you need additional buckets, you can increase your account bucket limit to a maximum of 1,000 buckets by submitting a service limit increase. There is no difference in performance whether you use many buckets or just a few.

`az`

Number of storage accounts per region per subscription, including standard, and premium storage accounts: 250.

`gcp`

There are no limits on the number of buckets you can create in Google Cloud Storage. — Jeff Terrace, Senior Software Engineer at Google, Google Cloud Storage.

`k8s`

There is no limit for the number of persistent volume claims beyond the provisioning limits of the underlying storage class.

Orchestration

`aws`

Auto Scaling groups per Region: 200 [...] To request an increase, use the Auto Scaling Limits form.

`az`

Maximum number of scale sets in a region: 2,500

`gcp`

Instance groups: Quota [...] This quota is per project. (empirically, defaults to 100)

`k8s`

There is no limit for the number of jobs that can be created in a cluster.

0x2b3bfa0 · 2021-11-25T23:59:02Z

Deletion

Tasks have been designed to be deleted as soon as they finish or, rather, as soon as the user realizes that they have finished; e.g. the next morning. Even in moderately big teams, it would be unusual to have more than 100 concurrent tasks when following this approach.¹

Teams with more than ~20 data scientists will necessarily have to be backed by specialized DevOps engineers, who already have the ability of increasing those limits, and even overcoming them with workarounds.²

Tasks, not machines; as per https://github.com/iterative/terraform-provider-iterative/issues/299#issuecomment-979514335, the latter have much higher limits. ↩
In the worst case, many of these limits can be avoided by using multiple regions, accounts or other provider–specific partitions. ↩

0x2b3bfa0 · 2021-11-26T03:09:16Z

Persistence

Task storage is not meant to replace persistent data/model storage and versioning tools like DVC. It's only meant to share state (e.g. checkpoints) between several short–lived machines.

Even DVC, which is a data–oriented tool, doesn't¹ include a mechanism to create (and much less delete) persistent storage resources. This is something that most organizations prefer to manage separately and in often disparate ways.

Because of the diversity and complexity involved in persistent storage provisioning, it would be risky to include such a feature as part of this project.

Logs

Likewise, task storage is not meant to replace log management platforms.² For the same reasons as data, log ingestion, storage and monitoring should be performed by means of specialized tools.

Responsibility

Infrastructure management tools have the responsibility of deleting all the resources they create, even when those resources are meant to exist for long periods of time. Creating long–lived resources implies providing a way of deleting them.³

At least, that's what the official documentation says. ↩
For example, Amazon CloudWatch, Azure Monitor or Google Cloud Logging; links courtesy of @casperdcl. ↩
Requiring ClickOps to delete automatically created resources is wrong on so many levels. Big organizations won't run a tool that exhibits such a behavior, and smaller ones shouldn't, even if they were willing to. ↩

DavidGOrtega · 2021-11-29T13:00:18Z

Here we go again!
I having serious issues today with this. I can not even imagine people adopting this after the issues they might have. Right now i do not have more buckets because I use this in the CI and we do not have an effective destroy in place.
I have requested more buckets to AWS and im still waiting approval! So Im totally locked.
Seriously I wont use something like this

DavidGOrtega · 2021-11-29T13:10:08Z

Even in moderately big teams, it would be unusual to have more than 100 concurrent tasks when following this approach.

Who says? I can show you cases where 100 is super small. As we have been discussing a team with 3 models and a matrix can launch more than those 100 in a breeze and we do not have a perfect destroy in the CI yet

0x2b3bfa0 · 2021-11-29T18:16:40Z

As stated on #299 (comment), matrix use cases would benefit of parallelism and would only consume a single bucket and a single orchestration resource for all the created machines. See 0x2b3bfa0/cml-use-case-matrix-task for a rudimentary example.

0x2b3bfa0 · 2021-11-29T18:17:00Z

Unfortunately, quota requests take time. In the meanwhile, #314 should be addressed in order to prevent the bug or, at least, to mitigate it. Let's demote this discussion to important after extracting the critical issue.

dacbd · 2021-11-29T20:52:35Z

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account: buckets to use, security groups/firewall rules, instance role/service account, image/ami, etc.

Many of these are already set/created for organizations and this tool is for helping to manage short-lived instances. It should manage as few "infrastructure" pieces as possible while still being easy to use/low barrier of entry for more agile users?

DavidGOrtega · 2021-11-29T20:58:24Z

matrix use cases would benefit of parallelism and would only consume a single bucket

Can you please write down an example of how does it looks like?

I do not think that 0x2b3bfa0/cml-use-case-matrix-task resembles a useful or realistic case and I do not fully understand the usage of task there, we are using the task yo launch the runners, however this approach is not yet right (data sync is totally useless). To make such case interesting and useful the runners should recover the previous data folder once they start after the spot termination and that does not happens since the workdir changes in very runner startup.
An also its assuming that the training job will end before the workflow timeout

0x2b3bfa0 · 2021-12-03T00:09:31Z

Architecture

Tasks have been designed to be completely ephemeral, and able to run in a pristine cloud account without additional configuration. Treating storage¹ as an ephemeral resource may be an unusual choice, but there is no other way of avoiding a separate installation process.

If we wanted to use a shared bucket for all the tasks, it would make sense² to embrace the official providers instead of writing our own, and just publish a module meant to deploy persistent resources — like an object storage bucket or an instance orchestrator — to be used by every task. GitHub recommends something similar, but it's still overcomplicated.

Designing a “new” task orchestrator out of cloud primitives (virtual machine orchestrator, queue, log aggregator, object storage, et cetera) would imply reinventing the heptagonal wheel as previously stated, and ultimately lead us to consider nodeless Kubernetes solutions based on Elotl Kip (source code) and Cluster API spot instances.

See https://github.com/iterative/cml/issues/561#issuecomment-871019350 for context on the election of object storage over other types. ↩
The current implementation still [ab]uses Terraform, ignoring the official Provider Design Principles in attrocious ways. ↩

0x2b3bfa0 · 2021-12-03T00:33:31Z

I'm inclined to think that it's fine to use ephemeral buckets to cache data and keep artifacts until users “harvest” them. Still, treating object storage buckets as an ephemeral resource looks like a pretty unusual practice, as @dmpetrov pointed out.

Pinging @duijf, @JIoJIaJIu and @shcheklein for a ~~second~~ sixth opinion as requested. It would be awesome to have more feedback on the possible alternatives:

Use ephemeral buckets for every task and delete them as soon as users harvest the results
Require users to provide an existing bucket and store artifacts in separate “directories” for each task

There might be other alternatives I overlooked, though.

dmpetrov · 2021-12-03T09:40:13Z

Use ephemeral buckets for every task and delete them as soon as users harvest the results

Please note, ephemeral buckets means an actual bucket (not a key/path in an existing bucket). An ephemeral bucket is supposed to be created in "root" with a temporary name like s3://xpd-my-test-30g0bew1pcghg and deleted once job is done.

Require users to provide an existing bucket and store artifacts in separate “directories” for each task

This might be a path to an existing bucket with a path/key like user specified s3://iterative-ai/ml/segment/dmpetrov/ with dir name xpd-my-test-30g0bew1pcghg.

dmpetrov · 2021-12-03T09:45:24Z

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

@dacbd would you prefer to have an ephemeral / temporary bucket like s3://xpd-my-test-30g0bew1pcghg for each task or a temp directory in a user specified path like s3://iterative-ai/ml/segment/xpd-my-test-30g0bew1pcghg?

A separate question - would you prefer to keep output/state/logs of the task after the task is done or failed or remove the bucket or directory xpd-my-test-30g0bew1pcghg?

duijf · 2021-12-03T10:59:03Z

My 2 cents:

Try to play nice with existing infrastructure.
Directories within a bucket sounds like the best way to go.

More background below :)

Playing nice with existing infra

Medium - large orgs / teams probably already have data infrastructure + policies, etc. in place. Teams may have compliance requirements to document / track the purposes of their buckets, keep access logs, etc.
Mature ops teams already have tools to deal with these things. They already have these access control policies codified in CloudFormation / Terraform / Pulumi / etc. They probably don't want to move them to yet another tool. They probably want to define a role + limited set of buckets using the tools they already use and use TPI for the stuff it's good at.
If TPI would "own" the entire resource creation process / lifecycle, be aware that you are going to get a bunch of requests to expose different things that people care about, which can significantly increase the scope of the project. If you make it work with existing cloud resources, then you can sidestep this problem and say "you can always create resources manually".

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

Very much agreed with this. Hopefully being able to specify existing resources isn't mutually exclusive with some sort of onboarding experience where TPI can abstract some stuff for users that are new to this.

Buckets + quotas

The quota problems look pretty serious to me. Even if you can boost your quotas, I wouldn't bet on it that people would be happy to give up significant portions of their quota just for TPI.

From the outside, it looks like a pretty arbitrary decision to have a bucket per task which also adds a lot of extra moving parts + papercuts. I would seriously consider going for directories in a bucket that already exists

Cleanup

A separate question - would you prefer to keep output/state/logs of the task after the task is done or failed or remove the bucket or directory xpd-my-test-30g0bew1pcghg?

Cleanup everything seems like a good default, but this should probably be configurable. "Always cleanup", "Cleanup on success", "Never cleanup" all make sense to me. (Not sure if there is a usecase for "Only cleanup on failure")

dmpetrov · 2021-12-03T17:37:11Z

Try to play nice with existing infrastructure.

Directories within a bucket sounds like the best way to go.

I have the same feeling. My understanding: you create and delete a bucket when you provision a new resource like a database or a new system deployment. An experiment / train-task is not a new resource, it is a just a run.

Terraform backend might be a good analogy - it is using an existing path, and it does not destroy the bucket:

terraform {
  backend "s3" {
    bucket         = "terraform-up-and-running-state"
    key            = "global/s3/terraform.tfstate"
    region         = "us-east-2"
  }
}

dacbd · 2021-12-03T21:38:08Z

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

@dacbd would you prefer to have an ephemeral / temporary bucket like s3://xpd-my-test-30g0bew1pcghg for each task or a temp directory in a user specified path like s3://iterative-ai/ml/segment/xpd-my-test-30g0bew1pcghg?

A separate question - would you prefer to keep output/state/logs of the task after the task is done or failed or remove the bucket or directory xpd-my-test-30g0bew1pcghg?

@dmpetrov I think that an ephemeral bucket is a fine default but given a path it should use a directory at the base of the path:
given s3://reducedredunancy/bucket it would use s3://reducedredunancy/bucket/xpd-my-test-30g0bew1pcghg
given s3://30daylifecycle/policy/bucket it would use s3://30daylifecycle/policy/bucket/xpd-my-test-30g0bew1pcghg

and I'll reiterate @duijf in:

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

Very much agreed with this. Hopefully being able to specify existing resources isn't mutually exclusive with some sort of onboarding experience where TPI can abstract some stuff for users that are new to this.

0x2b3bfa0 · 2021-12-07T16:21:36Z

Thank you very much for the thorough feedback! ❤️

A good compromise would be adding a storage attribute to the task resource with the following behavior:

When unset, create/delete ephemeral buckets as we do now
When set, use the given prefix to create/delete “directories”

Examples

Cloud providers

storage = "bucket/path/prefix" to create “directories” on the specified bucket and (preferably) under the specified prefix.

Kubernetes

storage = "azurefile:30" to create a Persistent Volume Claim with a size of 30 GB (if applicable) from the azurefile Storage Class.

Limitations

Any resource specified through the storage attribute should already exist; i.e. be externally managed.
Leaving data in the cloud after destroying a task would only be possible when specifying the storage attribute.

Still, leaving data in the cloud after destroying the task comes with some challeges. Not sure if we should support this persistency use case out of the box.

0x2b3bfa0 · 2021-12-07T16:29:19Z

Probably related to the API proposal on #307 (comment): now we have enough storage–related attributes to consider creating a block for all the storage–related attributes.

dacbd · 2021-12-07T16:35:09Z

Still, leaving data in the cloud after destroying the task comes with some challeges. Not sure if we should support this persistency use case out of the box.

I think that when using a predefined bucket it would be easy enough for the user to ensure persistence before running terraform destroy or since task tears down the instance when execution is all complete they can persistent by simply not running terraform destroy?

dmpetrov · 2021-12-07T16:38:27Z

Still, leaving data in the cloud after destroying the task comes with some challeges. Not sure if we should support this persistency use case out of the box.

🤔 to my mind is quite opposite.

I understand where the motivation comes from - TF destroy should clean up allocated resources. But I'm not sure how this is applied to our use case. I consider logs, data and config files that TFI copies to cloud as logs. And removing logs looks like a bit strange practice.

TF itself also does not follow its own rules of destroying resources. See example with TF backend #299 (comment)

It feels like we are introducing artificial rules here 🙂 I'd suggest providing maximum flexibility for users and not destroying logs until users directly ask for it (in config).

casperdcl · 2021-12-08T20:40:43Z

related: terraform import existing resources to avoid the problem of creating & destroying things ourselves.

DavidGOrtega added bug Something isn't working cloud-aws Amazon Web Services p0-critical Max priority (ASAP) resource-task iterative_task TF resource labels Nov 25, 2021

DavidGOrtega mentioned this issue Nov 25, 2021

task After task completion computing resources should be released (destroyed) #302

Open

0x2b3bfa0 added the discussion Waiting for team decision label Nov 26, 2021

0x2b3bfa0 mentioned this issue Nov 29, 2021

Can't create new tasks after depleting the object storage bucket quota #314

Closed

This comment has been minimized.

Sign in to view

0x2b3bfa0 added p1-important High priority and removed p0-critical Max priority (ASAP) labels Nov 29, 2021

This comment has been minimized.

Sign in to view

0x2b3bfa0 added enhancement New feature or request and removed bug Something isn't working labels Dec 7, 2021

casperdcl mentioned this issue Jan 12, 2022

improve data sync features #362

Open

9 tasks

This was referenced Mar 3, 2022

Empty buckets during Delete() #420

Merged

task: rename workdir.input => storage.workdir & make output a subpath #435

Merged

0x2b3bfa0 mentioned this issue Apr 24, 2022

Improve the AWS VPC architecture #107

Closed

0x2b3bfa0 mentioned this issue Jun 3, 2022

Task secret management #602

Open

casperdcl assigned tasdomas Aug 2, 2022

tasdomas mentioned this issue Oct 5, 2022

Support reusing existing storage containers across task providers #687

Merged

casperdcl added the storage label Oct 17, 2022

tasdomas closed this as completed in #687 Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`task` bucket usage vs "directory" within a bucket #299

`task` bucket usage vs "directory" within a bucket #299

DavidGOrtega commented Nov 25, 2021 •

edited

Loading

0x2b3bfa0 commented Nov 25, 2021 •

edited

Loading

0x2b3bfa0 commented Nov 25, 2021 •

edited

Loading

0x2b3bfa0 commented Nov 26, 2021 •

edited

Loading

DavidGOrtega commented Nov 29, 2021 •

edited

Loading

DavidGOrtega commented Nov 29, 2021

0x2b3bfa0 commented Nov 29, 2021

0x2b3bfa0 commented Nov 29, 2021 •

edited

Loading

This comment has been minimized.

dacbd commented Nov 29, 2021

DavidGOrtega commented Nov 29, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

0x2b3bfa0 commented Dec 3, 2021 •

edited

Loading

0x2b3bfa0 commented Dec 3, 2021 •

edited

Loading

dmpetrov commented Dec 3, 2021 •

edited

Loading

dmpetrov commented Dec 3, 2021

duijf commented Dec 3, 2021

dmpetrov commented Dec 3, 2021

dacbd commented Dec 3, 2021

0x2b3bfa0 commented Dec 7, 2021

0x2b3bfa0 commented Dec 7, 2021

dacbd commented Dec 7, 2021 •

edited

Loading

dmpetrov commented Dec 7, 2021

casperdcl commented Dec 8, 2021

task bucket usage vs "directory" within a bucket #299

task bucket usage vs "directory" within a bucket #299

Comments

DavidGOrtega commented Nov 25, 2021 • edited Loading

0x2b3bfa0 commented Nov 25, 2021 • edited Loading

Limits

Storage

aws

az

gcp

k8s

Orchestration

aws

az

gcp

k8s

0x2b3bfa0 commented Nov 25, 2021 • edited Loading

Deletion

Footnotes

0x2b3bfa0 commented Nov 26, 2021 • edited Loading

Persistence

Logs

Responsibility

Footnotes

DavidGOrtega commented Nov 29, 2021 • edited Loading

DavidGOrtega commented Nov 29, 2021

0x2b3bfa0 commented Nov 29, 2021

0x2b3bfa0 commented Nov 29, 2021 • edited Loading

This comment has been minimized.

dacbd commented Nov 29, 2021

DavidGOrtega commented Nov 29, 2021 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

0x2b3bfa0 commented Dec 3, 2021 • edited Loading

Architecture

Footnotes

0x2b3bfa0 commented Dec 3, 2021 • edited Loading

dmpetrov commented Dec 3, 2021 • edited Loading

dmpetrov commented Dec 3, 2021

duijf commented Dec 3, 2021

Playing nice with existing infra

Buckets + quotas

Cleanup

dmpetrov commented Dec 3, 2021

dacbd commented Dec 3, 2021

0x2b3bfa0 commented Dec 7, 2021

Examples

Cloud providers

Kubernetes

Limitations

0x2b3bfa0 commented Dec 7, 2021

dacbd commented Dec 7, 2021 • edited Loading

dmpetrov commented Dec 7, 2021

casperdcl commented Dec 8, 2021

`task` bucket usage vs "directory" within a bucket #299

`task` bucket usage vs "directory" within a bucket #299

DavidGOrtega commented Nov 25, 2021 •

edited

Loading

0x2b3bfa0 commented Nov 25, 2021 •

edited

Loading

`aws`

`az`

`gcp`

`k8s`

`aws`

`az`

`gcp`

`k8s`

0x2b3bfa0 commented Nov 25, 2021 •

edited

Loading

0x2b3bfa0 commented Nov 26, 2021 •

edited

Loading

DavidGOrtega commented Nov 29, 2021 •

edited

Loading

0x2b3bfa0 commented Nov 29, 2021 •

edited

Loading

DavidGOrtega commented Nov 29, 2021 •

edited

Loading

0x2b3bfa0 commented Dec 3, 2021 •

edited

Loading

0x2b3bfa0 commented Dec 3, 2021 •

edited

Loading

dmpetrov commented Dec 3, 2021 •

edited

Loading

dacbd commented Dec 7, 2021 •

edited

Loading