Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task bucket usage vs "directory" within a bucket #299

Closed
Tracked by #362
DavidGOrtega opened this issue Nov 25, 2021 · 26 comments · Fixed by #687
Closed
Tracked by #362

task bucket usage vs "directory" within a bucket #299

DavidGOrtega opened this issue Nov 25, 2021 · 26 comments · Fixed by #687
Assignees
Labels
cloud-aws Amazon Web Services discussion Waiting for team decision enhancement New feature or request p1-important High priority resource-task iterative_task TF resource storage

Comments

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Nov 25, 2021

We are generating a bucket with every task, however this approach has several outcomes and a bug:

  • We are limited to the number of tasks
  • Removal is tricky and maybe not desirable by the user. They might want to keep the tasks for a long time
  • If your quota is depleted you can not delete tasks even to create new ones bug

A better approach would be having specified a bucket by the user or if not create the default .tpi bucket.

@DavidGOrtega DavidGOrtega added bug Something isn't working cloud-aws Amazon Web Services p0-critical Max priority (ASAP) resource-task iterative_task TF resource labels Nov 25, 2021
@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Nov 25, 2021

Limits

As per the research below, users can have at least 100 concurrent tasks running in any of the supported providers. Upon request, limits can be increased up to 250 per region in the worst case scenario.

Highly parallelizable tasks, like hyperparameter optimization, can use the parallelism argument to lauch up to thousands of machines with shared storage, without exceding any of these limits.

Moreover, the number of concurrent tasks is usually bound to the orchestration limits, not to the storage limits.

Storage

aws

By default, you can create up to 100 buckets in each of your AWS accounts. If you need additional buckets, you can increase your account bucket limit to a maximum of 1,000 buckets by submitting a service limit increase. There is no difference in performance whether you use many buckets or just a few.

az

Number of storage accounts per region per subscription, including standard, and premium storage accounts: 250.

gcp

There are no limits on the number of buckets you can create in Google Cloud Storage.Jeff Terrace, Senior Software Engineer at Google, Google Cloud Storage.

k8s

There is no limit for the number of persistent volume claims beyond the provisioning limits of the underlying storage class.

Orchestration

aws

Auto Scaling groups per Region: 200 [...] To request an increase, use the Auto Scaling Limits form.

az

Maximum number of scale sets in a region: 2,500

gcp

Instance groups: Quota [...] This quota is per project. (empirically, defaults to 100)

k8s

There is no limit for the number of jobs that can be created in a cluster.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Nov 25, 2021

Deletion

Tasks have been designed to be deleted as soon as they finish or, rather, as soon as the user realizes that they have finished; e.g. the next morning. Even in moderately big teams, it would be unusual to have more than 100 concurrent tasks when following this approach.1

Teams with more than ~20 data scientists will necessarily have to be backed by specialized DevOps engineers, who already have the ability of increasing those limits, and even overcoming them with workarounds.2

Footnotes

  1. Tasks, not machines; as per https://github.com/iterative/terraform-provider-iterative/issues/299#issuecomment-979514335, the latter have much higher limits.

  2. In the worst case, many of these limits can be avoided by using multiple regions, accounts or other provider–specific partitions.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Nov 26, 2021

Persistence

Task storage is not meant to replace persistent data/model storage and versioning tools like DVC. It's only meant to share state (e.g. checkpoints) between several short–lived machines.

Even DVC, which is a data–oriented tool, doesn't1 include a mechanism to create (and much less delete) persistent storage resources. This is something that most organizations prefer to manage separately and in often disparate ways.

Because of the diversity and complexity involved in persistent storage provisioning, it would be risky to include such a feature as part of this project.

Logs

Likewise, task storage is not meant to replace log management platforms.2 For the same reasons as data, log ingestion, storage and monitoring should be performed by means of specialized tools.

Responsibility

Infrastructure management tools have the responsibility of deleting all the resources they create, even when those resources are meant to exist for long periods of time. Creating long–lived resources implies providing a way of deleting them.3

Footnotes

  1. At least, that's what the official documentation says.

  2. For example, Amazon CloudWatch, Azure Monitor or Google Cloud Logging; links courtesy of @casperdcl.

  3. Requiring ClickOps to delete automatically created resources is wrong on so many levels. Big organizations won't run a tool that exhibits such a behavior, and smaller ones shouldn't, even if they were willing to.

@0x2b3bfa0 0x2b3bfa0 added the discussion Waiting for team decision label Nov 26, 2021
@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Nov 29, 2021

Here we go again!
I having serious issues today with this. I can not even imagine people adopting this after the issues they might have. Right now i do not have more buckets because I use this in the CI and we do not have an effective destroy in place.
I have requested more buckets to AWS and im still waiting approval! So Im totally locked.
Seriously I wont use something like this

@DavidGOrtega
Copy link
Contributor Author

Even in moderately big teams, it would be unusual to have more than 100 concurrent tasks when following this approach.

Who says? I can show you cases where 100 is super small. As we have been discussing a team with 3 models and a matrix can launch more than those 100 in a breeze and we do not have a perfect destroy in the CI yet

@0x2b3bfa0
Copy link
Member

As stated on #299 (comment), matrix use cases would benefit of parallelism and would only consume a single bucket and a single orchestration resource for all the created machines. See 0x2b3bfa0/cml-use-case-matrix-task for a rudimentary example.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Nov 29, 2021

Unfortunately, quota requests take time. In the meanwhile, #314 should be addressed in order to prevent the bug or, at least, to mitigate it. Let's demote this discussion to important after extracting the critical issue.

@0x2b3bfa0

This comment has been minimized.

@0x2b3bfa0 0x2b3bfa0 added p1-important High priority and removed p0-critical Max priority (ASAP) labels Nov 29, 2021
@dacbd
Copy link
Contributor

dacbd commented Nov 29, 2021

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account: buckets to use, security groups/firewall rules, instance role/service account, image/ami, etc.

Many of these are already set/created for organizations and this tool is for helping to manage short-lived instances. It should manage as few "infrastructure" pieces as possible while still being easy to use/low barrier of entry for more agile users?

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Nov 29, 2021

matrix use cases would benefit of parallelism and would only consume a single bucket

Can you please write down an example of how does it looks like?

I do not think that 0x2b3bfa0/cml-use-case-matrix-task resembles a useful or realistic case and I do not fully understand the usage of task there, we are using the task yo launch the runners, however this approach is not yet right (data sync is totally useless). To make such case interesting and useful the runners should recover the previous data folder once they start after the spot termination and that does not happens since the workdir changes in very runner startup.
An also its assuming that the training job will end before the workflow timeout

@0x2b3bfa0

This comment has been minimized.

@0x2b3bfa0

This comment has been minimized.

@DavidGOrtega

This comment has been minimized.

@0x2b3bfa0

This comment has been minimized.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Dec 3, 2021

Architecture

Tasks have been designed to be completely ephemeral, and able to run in a pristine cloud account without additional configuration. Treating storage1 as an ephemeral resource may be an unusual choice, but there is no other way of avoiding a separate installation process.

If we wanted to use a shared bucket for all the tasks, it would make sense2 to embrace the official providers instead of writing our own, and just publish a module meant to deploy persistent resources — like an object storage bucket or an instance orchestrator — to be used by every task. GitHub recommends something similar, but it's still overcomplicated.

Designing a “new” task orchestrator out of cloud primitives (virtual machine orchestrator, queue, log aggregator, object storage, et cetera) would imply reinventing the heptagonal wheel as previously stated, and ultimately lead us to consider nodeless Kubernetes solutions based on Elotl Kip (source code) and Cluster API spot instances.

Footnotes

  1. See https://github.com/iterative/cml/issues/561#issuecomment-871019350 for context on the election of object storage over other types.

  2. The current implementation still [ab]uses Terraform, ignoring the official Provider Design Principles in attrocious ways.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Dec 3, 2021

I'm inclined to think that it's fine to use ephemeral buckets to cache data and keep artifacts until users “harvest” them. Still, treating object storage buckets as an ephemeral resource looks like a pretty unusual practice, as @dmpetrov pointed out.

Pinging @duijf, @JIoJIaJIu and @shcheklein for a second sixth opinion as requested. It would be awesome to have more feedback on the possible alternatives:

  1. Use ephemeral buckets for every task and delete them as soon as users harvest the results
  2. Require users to provide an existing bucket and store artifacts in separate “directories” for each task

There might be other alternatives I overlooked, though.

@dmpetrov
Copy link
Member

dmpetrov commented Dec 3, 2021

  1. Use ephemeral buckets for every task and delete them as soon as users harvest the results

Please note, ephemeral buckets means an actual bucket (not a key/path in an existing bucket). An ephemeral bucket is supposed to be created in "root" with a temporary name like s3://xpd-my-test-30g0bew1pcghg and deleted once job is done.

  1. Require users to provide an existing bucket and store artifacts in separate “directories” for each task

This might be a path to an existing bucket with a path/key like user specified s3://iterative-ai/ml/segment/dmpetrov/ with dir name xpd-my-test-30g0bew1pcghg.

@dmpetrov
Copy link
Member

dmpetrov commented Dec 3, 2021

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

@dacbd would you prefer to have an ephemeral / temporary bucket like s3://xpd-my-test-30g0bew1pcghg for each task or a temp directory in a user specified path like s3://iterative-ai/ml/segment/xpd-my-test-30g0bew1pcghg?

A separate question - would you prefer to keep output/state/logs of the task after the task is done or failed or remove the bucket or directory xpd-my-test-30g0bew1pcghg?

@duijf
Copy link

duijf commented Dec 3, 2021

My 2 cents:

  • Try to play nice with existing infrastructure.
  • Directories within a bucket sounds like the best way to go.

More background below :)

Playing nice with existing infra

  • Medium - large orgs / teams probably already have data infrastructure + policies, etc. in place. Teams may have compliance requirements to document / track the purposes of their buckets, keep access logs, etc.
  • Mature ops teams already have tools to deal with these things. They already have these access control policies codified in CloudFormation / Terraform / Pulumi / etc. They probably don't want to move them to yet another tool. They probably want to define a role + limited set of buckets using the tools they already use and use TPI for the stuff it's good at.
  • If TPI would "own" the entire resource creation process / lifecycle, be aware that you are going to get a bunch of requests to expose different things that people care about, which can significantly increase the scope of the project. If you make it work with existing cloud resources, then you can sidestep this problem and say "you can always create resources manually".

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

Very much agreed with this. Hopefully being able to specify existing resources isn't mutually exclusive with some sort of onboarding experience where TPI can abstract some stuff for users that are new to this.

Buckets + quotas

The quota problems look pretty serious to me. Even if you can boost your quotas, I wouldn't bet on it that people would be happy to give up significant portions of their quota just for TPI.

From the outside, it looks like a pretty arbitrary decision to have a bucket per task which also adds a lot of extra moving parts + papercuts. I would seriously consider going for directories in a bucket that already exists

Cleanup

A separate question - would you prefer to keep output/state/logs of the task after the task is done or failed or remove the bucket or directory xpd-my-test-30g0bew1pcghg?

Cleanup everything seems like a good default, but this should probably be configurable. "Always cleanup", "Cleanup on success", "Never cleanup" all make sense to me. (Not sure if there is a usecase for "Only cleanup on failure")

@dmpetrov
Copy link
Member

dmpetrov commented Dec 3, 2021

  • Try to play nice with existing infrastructure.
  • Directories within a bucket sounds like the best way to go.

I have the same feeling. My understanding: you create and delete a bucket when you provision a new resource like a database or a new system deployment. An experiment / train-task is not a new resource, it is a just a run.

Terraform backend might be a good analogy - it is using an existing path, and it does not destroy the bucket:

terraform {
  backend "s3" {
    bucket         = "terraform-up-and-running-state"
    key            = "global/s3/terraform.tfstate"
    region         = "us-east-2"
  }
}

@dacbd
Copy link
Contributor

dacbd commented Dec 3, 2021

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

@dacbd would you prefer to have an ephemeral / temporary bucket like s3://xpd-my-test-30g0bew1pcghg for each task or a temp directory in a user specified path like s3://iterative-ai/ml/segment/xpd-my-test-30g0bew1pcghg?

A separate question - would you prefer to keep output/state/logs of the task after the task is done or failed or remove the bucket or directory xpd-my-test-30g0bew1pcghg?

@dmpetrov I think that an ephemeral bucket is a fine default but given a path it should use a directory at the base of the path:
given s3://reducedredunancy/bucket it would use s3://reducedredunancy/bucket/xpd-my-test-30g0bew1pcghg
given s3://30daylifecycle/policy/bucket it would use s3://30daylifecycle/policy/bucket/xpd-my-test-30g0bew1pcghg

and I'll reiterate @duijf in:

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

Very much agreed with this. Hopefully being able to specify existing resources isn't mutually exclusive with some sort of onboarding experience where TPI can abstract some stuff for users that are new to this.

@0x2b3bfa0
Copy link
Member

Thank you very much for the thorough feedback! ❤️

A good compromise would be adding a storage attribute to the task resource with the following behavior:

  1. When unset, create/delete ephemeral buckets as we do now
  2. When set, use the given prefix to create/delete “directories”

Examples

Cloud providers

storage = "bucket/path/prefix" to create “directories” on the specified bucket and (preferably) under the specified prefix.

Kubernetes

storage = "azurefile:30" to create a Persistent Volume Claim with a size of 30 GB (if applicable) from the azurefile Storage Class.

Limitations

  • Any resource specified through the storage attribute should already exist; i.e. be externally managed.
  • Leaving data in the cloud after destroying a task would only be possible when specifying the storage attribute.

Still, leaving data in the cloud after destroying the task comes with some challeges. Not sure if we should support this persistency use case out of the box.

@0x2b3bfa0 0x2b3bfa0 added enhancement New feature or request and removed bug Something isn't working labels Dec 7, 2021
@0x2b3bfa0
Copy link
Member

Probably related to the API proposal on #307 (comment): now we have enough storage–related attributes to consider creating a block for all the storage–related attributes.

@dacbd
Copy link
Contributor

dacbd commented Dec 7, 2021

Still, leaving data in the cloud after destroying the task comes with some challeges. Not sure if we should support this persistency use case out of the box.

I think that when using a predefined bucket it would be easy enough for the user to ensure persistence before running terraform destroy or since task tears down the instance when execution is all complete they can persistent by simply not running terraform destroy?

@dmpetrov
Copy link
Member

dmpetrov commented Dec 7, 2021

Still, leaving data in the cloud after destroying the task comes with some challeges. Not sure if we should support this persistency use case out of the box.

🤔 to my mind is quite opposite.

I understand where the motivation comes from - TF destroy should clean up allocated resources. But I'm not sure how this is applied to our use case. I consider logs, data and config files that TFI copies to cloud as logs. And removing logs looks like a bit strange practice.

TF itself also does not follow its own rules of destroying resources. See example with TF backend #299 (comment)

It feels like we are introducing artificial rules here 🙂 I'd suggest providing maximum flexibility for users and not destroying logs until users directly ask for it (in config).

@casperdcl
Copy link
Contributor

related: terraform import existing resources to avoid the problem of creating & destroying things ourselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-aws Amazon Web Services discussion Waiting for team decision enhancement New feature or request p1-important High priority resource-task iterative_task TF resource storage
Projects
None yet
7 participants