Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Hub] University of Washington - NASA SnowEx Hackweek 2022 #1309

Closed
6 of 7 tasks
damianavila opened this issue May 13, 2022 · 33 comments
Closed
6 of 7 tasks

[New Hub] University of Washington - NASA SnowEx Hackweek 2022 #1309

damianavila opened this issue May 13, 2022 · 33 comments
Assignees

Comments

@damianavila
Copy link
Contributor

damianavila commented May 13, 2022

Hub Description

The hub has the following needs:

Community Representative(s)

@scottyhq

Important dates

  • Required start date: Jun 11th, 2022
  • Target start date: Jun 1st, 2022
  • Any important dates for usage:
    • 2i2c engineer on-hand for same-day answering questions or troubleshooting during week-long main event Pacific Time 8-4 PT (July 11 - July 15)
    • Hub will remain accessible through September 1, 2022 or when budget runs out, whichever comes first

Hub Authentication Type

GitHub Authentication (e.g., @MyGitHubHandle)

Hub logo information

  • URL to Hub Image: {{ URL HERE }}
  • URL for Image Link: {{ URL HERE }}

Hub user image

Extra features you'd like to enable

  • Specific cloud provider or datacenter: AWS
  • Dedicated Kubernetes cluster
  • Scalable Dask Cluster

Other relevant information

GPU support in AWS is not yet available in our infra and it should be developed for this Hub.

Hub URL

TBD.TBD.2i2c.cloud

Hub Type

daskhub

Tasks to deploy the hub

  • GPU setup on eksctl
  • S3 scratch bucket setup
  • GitHub Teams based authentication setup
  • Deploy on existing uwhackweeks cluster
@yuvipanda
Copy link
Member

I've updated the issue with the TODOs that need to happen.

I'm going to use the existing uwhackweeks infra to prototype GPU support as well the scratch bucket support.

@yuvipanda
Copy link
Member

I've asked for an increase in quota on the uwhackweeks AWS account for GPU instances, and can proceed onces that comes through.

I've also asked @scottyhq for credits voucher to kickstart a new account.

@yuvipanda
Copy link
Member

We're going to use a new account so that costs for this are separate from costs for the current uwhackweeks setup

@damianavila
Copy link
Contributor Author

I've asked for an increase in quota on the uwhackweeks AWS account for GPU instances, and can proceed onces that comes through.

Can we list here the specific quotas you requested an increase for (so we can later document the specific requirement and process)? Thanks!

@damianavila damianavila moved this from Ready to work to In progress in DEPRECATED Engineering and Product Backlog May 16, 2022
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue May 17, 2022
- Document howto set up GPUs on AWS
- Temporarily add a GPU profile to the uwhackweeks
  hub, until we setup an account for the snowex hackweek

Ref 2i2c-org#1309
@yuvipanda
Copy link
Member

@damianavila documented as part of #1314

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue May 17, 2022
- Document howto set up GPUs on AWS
- Temporarily add a GPU profile to the uwhackweeks
  hub, until we setup an account for the snowex hackweek

Ref 2i2c-org#1309
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue May 23, 2022
- Bump up AWS provider version, as there had been a few deprecations
  in the IAM resources
- Mimic the GCP setup as much as possible

Ref 2i2c-org#1309
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue May 24, 2022
- Bump up AWS provider version, as there had been a few deprecations
  in the IAM resources
- Mimic the GCP setup as much as possible
- Write some docs on how to enable this for S3
- Setup IRSA with Terraform so users can be granted IAM roles
- Use the uwhackweeks to test

Ref 2i2c-org#1309
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue May 24, 2022
- Bump up AWS provider version, as there had been a few deprecations
  in the IAM resources
- Mimic the GCP setup as much as possible
- Write some docs on how to enable this for S3
- Setup IRSA with Terraform so users can be granted IAM roles
- Use the uwhackweeks to test

Ref 2i2c-org#1309
@scottyhq
Copy link
Contributor

Hey all, thanks for getting this going! I think we're set now with credits for a new account for this. A couple more details and questions below:

Hub logo information: https://snowex.hackweek.io
URL to Hub Image:  https://github.com/snowex-hackweek/website2022/raw/main/book/logo.png

For the URL, we currently have uwhackweeks.2i2c.cloud mapping to the ICESat-2 Hackweek Hub. Coincidentally, we are shutting that down June 1 right as this is new Hub becomes active, so I think we can just use the same URL. In the future, it it will be useful to use subdomains in case of overlapping events: snowex.uwhackweeks.2i2c.cloud?

For "Community representatives" we can add @jomey! who is going to be the main point of contact during the week of the event.

@yuvipanda
Copy link
Member

@scottyhq ah, so how about this:

  1. We apply the new credits to the same cluster.
  2. We bring down the old hub (at uwhackweeks.2i2c.cloud) and delete all the home directories
  3. We bring up new hub at snowex.uwhackweeks.2i2c.cloud, with fresh home directories and everything
  4. This will allow us to easily see the cloud cost difference, as there will be no overlap
  5. This will make setup easier, as we won't need a new account. We can also continue using the credits we previously have, and most importantly it'll make it easier to get more GPU quota.

Does this sound acceptable to you?

@scottyhq
Copy link
Contributor

Does this sound acceptable to you?

Sounds great with the condition that we do this transition after June 1 (we told people that's the end date for the icesat-2 hub!)

@yuvipanda
Copy link
Member

@scottyhq sounds good! I'll setup the snowex hub first, and we can separately decomm the existing hub later.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue May 26, 2022
- Moves current hackweeks hub (which is really icesat hackweek hub)
  config out of common and into staging / prod.yaml
- Add new config for snowex hackweek
- Add scratch bucket for snowex hackweek

Ref 2i2c-org#1309
@yuvipanda
Copy link
Member

@scottyhq ok I've set up https://snowex.uwhackweeks.2i2c.cloud now! Take it for a spin?

I've also setup what is needed for https://snowex.uwhackweeks.2i2c.cloud to work - if you install gh-scoped-creds in your image, you should be able to securely push to GitHub too!

@yuvipanda
Copy link
Member

@scottyhq you'd also need to give access to the JupyterHub (https://infrastructure.2i2c.org/en/latest/howto/configure/auth-management.html?highlight=github#follow-up-github-organization-administrators-must-grant-access) for it to allow logins based on GitHub teams.

@yuvipanda
Copy link
Member

@scottyhq ok, I think this is completely set up on https://snowex.uwhackweeks.2i2c.cloud/. Try it and let me know if it works out ok?

I opened #1336 for decomissioning the existing hub.

@yuvipanda
Copy link
Member

@scottyhq also, we only have GPU quota for 2 concurrent GPU instances now. Can you tell me how many you would want to support, and I'll ask for a quota increase right away?

@scottyhq
Copy link
Contributor

Awesome! will kick the tires today.

Can you tell me how many you would want to support, and I'll ask for a quota increase right away?

Ideally we want to support up to 100 simultaneous users. But we'll make due with whatever we get. A minimum viable number would be ~20 (where one person per group in the hackweek has guaranteed access to a GPU node).

Repository owner moved this from In progress to Complete in DEPRECATED Engineering and Product Backlog May 26, 2022
@yuvipanda yuvipanda reopened this May 26, 2022
@yuvipanda
Copy link
Member

Accidental close!

@scottyhq ok I'll ask! Note that the GPU nodes will only have 4CPUs & about 55G of RAM. Is that ok? And I'm getting K80 GPUs

@scottyhq
Copy link
Contributor

CPU and RAM is fine, generally I just assume the instance resource ratios are optimized in some way for most workloads. Not too picky about GPU type, but it would be nice to have more modern options (T4, A100, V100), and perhaps a second type would also help with increased quota.

@damianavila damianavila moved this from Complete to In progress in DEPRECATED Engineering and Product Backlog May 30, 2022
@damianavila
Copy link
Contributor Author

@yuvipanda, do you have any news about the quota increase you have requested?

@yuvipanda
Copy link
Member

@damianavila @scottyhq so i heard back, and our quota increase was approved only until 32 - so that gives us just 8 GPUs :( The quotas are for 'P and G' types together, and that's all the GPU types - so we can't split it up among multiple types either. I also asked if paying for premium support would increase the chances of the quota being granted and was told no it would not.

The specific response is:


Hello,

Thank you for your patience. We partially fulfilled your quota increase request. Your new quota for All P and G instances is 32.

We can reassess a higher quota increase at a later stage. In the meantime, consider alternative instance types, or spreading your instances across AWS Regions.

For a full list of our alternative instance types, see:
http://aws.amazon.com/ec2/instance-types/

To avoid processing delays, submit a quota increase requests in the Service Quotas console:
-https://console.aws.amazon.com/servicequotas/home

Feel free to ask if you have any other questions.

Have an awesome week ahead.

Please let us know if we helped resolve your issue:

This is a bit frustrating, as I had explained to them why we wanted the quota increase we did.

@yuvipanda
Copy link
Member

Multiple regions doesn't really work for us since we want to stay in us-west-2, and our home directories are there. We could ask for a spot instance quota increase too maybe?

@damianavila
Copy link
Contributor Author

We could ask for a spot instance quota increase too maybe?

But that would help with the imposed limit?
I guess they are trying to "force" us to use a bigger P instance?

@yuvipanda
Copy link
Member

@damianavila the quota is for CPU count so bigger instances won't help

@damianavila
Copy link
Contributor Author

OK... let me see if I understand this...
We currently have a p2.xlarge instance where the original quota was 8 CPU which translates to 8/4 = 2 GPU.

Screen Shot 2022-06-02 at 10 28 21

You requested a raise and they give you 32 CPUs which translates to 32/4 = 8 GPU.
That is somehow expected if you see the available instances from the above table.
They want to avoid multiple tiny boxes and, instead, "forcing" you through the quota to use a bigger instance.

If you use the p2.8xlarge, you will have the 32 CPU and 8 GPU but your quota will most likely? (I am supposing that will be the case, I might be wrong, it should be something to ask their support team) be bigger than that, I presume between 32 and 64 (with a max 64)... because at 64 CPU they will "force" you again to use p2.16xlarge.

Does it makes sense or I am talking nonsense 😉?

@yuvipanda
Copy link
Member

The ratio of CPU to GPU is 4 for all p2 instances, so we can only get 8 GPUs with 32 CPU quota regardless of the size of instances we use. So we can get 32 total CPUs, which will only provide us with at most 8GPUs in whatever configuration.

@damianavila
Copy link
Contributor Author

OK... but that is assuming you have the 32 quota fixed across the instances... what I am saying here is that you have that quota fixed for the smaller instances BUT I think they might raise the quota limit if you use a bigger instance.
For instance, if you use the larger instance with 64 CPUs, a fixed quota of 32 is totally non-sense! Why are you paying for an instance where you can use half of the available GPUs?

@yuvipanda
Copy link
Member

@damianavila quota is fixed across the instance family - this is for all P and G type instances put together. With a 32 quota, trying to launch a instance with 64 CPUs will just fail with an error message about not enough quotas.

@damianavila
Copy link
Contributor Author

With a 32 quota, trying to launch a instance with 64 CPUs will just fail with an error message about not enough quotas.

But you agree with me this is totally nonsense, right? Or am I missing something else?

@yuvipanda
Copy link
Member

@damianavila that they gave us only 32 CPU quota (given I had asked for 400) is definitely nonsense and I agree! But I'm not sure I fully understand what the suggested next step is. In particular, I don't think us trying to use different instance sizes matters - so I'm a little confused there!

@damianavila
Copy link
Contributor Author

damianavila commented Jun 2, 2022

@yuvipanda, I would probably further ask AWS support IF changing the instance to a bigger one actually gives us the opportunity to additionally being granted with a higher quota/limit (more than 32).
To be clear, I am not suggesting going with a bigger instance until we have explicit confirmation from AWS that "they will raise the quota to more than 32 CPUs if we run with a bigger instance".
Sorry if I am not being clear enough with my writings... and feel free to ignore if you think this is a dead-end road.

@scottyhq
Copy link
Contributor

scottyhq commented Jun 2, 2022

Frustrating that the quota is so low! I'd reply with something like the following:

We've looked into alternative instance types but our use-case requires NVIDIA GPUs and Intel CPUs (https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html) for software compatibility. The NASA datasets we are working with are very large and hosted in us-west-2 so it is not cost-effective nor efficient to launch instances in multiple regions (https://www.earthdata.nasa.gov/learn/articles/eosdis-data-cloud-user-requirements).

We are provisioning servers for the 100s of scientists for research and educational events, but with the current quota only 8 have access to a GPU. Also, in order to compare performance of various instance options we need at least a quota to run the maximum size of P2,P3,G4 instances, and preferably more in order to run instances simultaneously, which is why we again request a quota increase to 256 vCPU.

@yuvipanda
Copy link
Member

@damianavila I can ask them specifically, but based on my experience so far I think it's a dead end. They generally don't care how many instances you use the CPUs in. But if your experience with AWS has been different, I can ask them!

@damianavila
Copy link
Contributor Author

But if your experience with AWS has been different, I can ask them!

In general, I did not have a different experience than you with their support.

@yuvipanda
Copy link
Member

We got bigger GPU quota thanks to @cgentemann!

I think this issue can be closed now.

@choldgraf
Copy link
Member

Follow-up: I opened up the following issue to discuss these issues around new AWS organizations and resource limits/quotas:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

4 participants