[New Hub] University of Washington - NASA SnowEx Hackweek 2022 #1309

damianavila · 2022-05-13T00:28:52Z

Hub Description

The hub has the following needs:

Deployed in AWS us-west-2 datacenter
Support up to 100 simultaneous users
Minimum of 4 CPU, 16GB RAM, 10GB home directory per user
GPU node group with NVIDIA K80, T4, or V100 GPU
Read-write access to S3 scratch bucket
Gate usage with GitHub Teams within Organization authentication https://github.com/orgs/snowex-hackweek/teams/participants-2022
Auto-deploy single docker image
https://quay.io/repository/uwhackweek/snowex

Community Representative(s)

@scottyhq

Important dates

Required start date: Jun 11th, 2022
Target start date: Jun 1st, 2022
Any important dates for usage:
- 2i2c engineer on-hand for same-day answering questions or troubleshooting during week-long main event Pacific Time 8-4 PT (July 11 - July 15)
- Hub will remain accessible through September 1, 2022 or when budget runs out, whichever comes first

Hub Authentication Type

GitHub Authentication (e.g., @MyGitHubHandle)

Hub logo information

URL to Hub Image: {{ URL HERE }}
URL for Image Link: {{ URL HERE }}

Hub user image

Repository for user image: { REPO LINK IF IT EXISTS }
User image registry: https://quay.io/repository/uwhackweek/snowex
User image tag and name: { NAME AND TAG IF IT EXISTS }

Extra features you'd like to enable

Specific cloud provider or datacenter: AWS
Dedicated Kubernetes cluster
Scalable Dask Cluster

Other relevant information

GPU support in AWS is not yet available in our infra and it should be developed for this Hub.

Hub URL

TBD.TBD.2i2c.cloud

Hub Type

daskhub

Tasks to deploy the hub

GPU setup on eksctl
S3 scratch bucket setup
GitHub Teams based authentication setup
Deploy on existing uwhackweeks cluster

The text was updated successfully, but these errors were encountered:

yuvipanda · 2022-05-13T07:44:20Z

I've updated the issue with the TODOs that need to happen.

I'm going to use the existing uwhackweeks infra to prototype GPU support as well the scratch bucket support.

yuvipanda · 2022-05-13T09:38:59Z

I've asked for an increase in quota on the uwhackweeks AWS account for GPU instances, and can proceed onces that comes through.

I've also asked @scottyhq for credits voucher to kickstart a new account.

yuvipanda · 2022-05-13T09:39:16Z

We're going to use a new account so that costs for this are separate from costs for the current uwhackweeks setup

damianavila · 2022-05-16T22:36:19Z

I've asked for an increase in quota on the uwhackweeks AWS account for GPU instances, and can proceed onces that comes through.

Can we list here the specific quotas you requested an increase for (so we can later document the specific requirement and process)? Thanks!

- Document howto set up GPUs on AWS - Temporarily add a GPU profile to the uwhackweeks hub, until we setup an account for the snowex hackweek Ref 2i2c-org#1309

yuvipanda · 2022-05-17T09:37:58Z

@damianavila documented as part of #1314

- Document howto set up GPUs on AWS - Temporarily add a GPU profile to the uwhackweeks hub, until we setup an account for the snowex hackweek Ref 2i2c-org#1309

- Bump up AWS provider version, as there had been a few deprecations in the IAM resources - Mimic the GCP setup as much as possible Ref 2i2c-org#1309

- Bump up AWS provider version, as there had been a few deprecations in the IAM resources - Mimic the GCP setup as much as possible - Write some docs on how to enable this for S3 - Setup IRSA with Terraform so users can be granted IAM roles - Use the uwhackweeks to test Ref 2i2c-org#1309

scottyhq · 2022-05-24T18:09:12Z

Hey all, thanks for getting this going! I think we're set now with credits for a new account for this. A couple more details and questions below:

Hub logo information: https://snowex.hackweek.io
URL to Hub Image:  https://github.com/snowex-hackweek/website2022/raw/main/book/logo.png

For the URL, we currently have uwhackweeks.2i2c.cloud mapping to the ICESat-2 Hackweek Hub. Coincidentally, we are shutting that down June 1 right as this is new Hub becomes active, so I think we can just use the same URL. In the future, it it will be useful to use subdomains in case of overlapping events: snowex.uwhackweeks.2i2c.cloud?

For "Community representatives" we can add @jomey! who is going to be the main point of contact during the week of the event.

yuvipanda · 2022-05-25T09:43:47Z

@scottyhq ah, so how about this:

We apply the new credits to the same cluster.
We bring down the old hub (at uwhackweeks.2i2c.cloud) and delete all the home directories
We bring up new hub at snowex.uwhackweeks.2i2c.cloud, with fresh home directories and everything
This will allow us to easily see the cloud cost difference, as there will be no overlap
This will make setup easier, as we won't need a new account. We can also continue using the credits we previously have, and most importantly it'll make it easier to get more GPU quota.

Does this sound acceptable to you?

scottyhq · 2022-05-25T19:24:40Z

Does this sound acceptable to you?

Sounds great with the condition that we do this transition after June 1 (we told people that's the end date for the icesat-2 hub!)

yuvipanda · 2022-05-25T19:26:22Z

@scottyhq sounds good! I'll setup the snowex hub first, and we can separately decomm the existing hub later.

- Moves current hackweeks hub (which is really icesat hackweek hub) config out of common and into staging / prod.yaml - Add new config for snowex hackweek - Add scratch bucket for snowex hackweek Ref 2i2c-org#1309

yuvipanda · 2022-05-26T08:41:19Z

@scottyhq ok I've set up https://snowex.uwhackweeks.2i2c.cloud now! Take it for a spin?

I've also setup what is needed for https://snowex.uwhackweeks.2i2c.cloud to work - if you install gh-scoped-creds in your image, you should be able to securely push to GitHub too!

yuvipanda · 2022-05-26T08:43:47Z

@scottyhq you'd also need to give access to the JupyterHub (https://infrastructure.2i2c.org/en/latest/howto/configure/auth-management.html?highlight=github#follow-up-github-organization-administrators-must-grant-access) for it to allow logins based on GitHub teams.

yuvipanda · 2022-05-26T14:26:37Z

@scottyhq ok, I think this is completely set up on https://snowex.uwhackweeks.2i2c.cloud/. Try it and let me know if it works out ok?

I opened #1336 for decomissioning the existing hub.

yuvipanda · 2022-05-26T15:14:28Z

@scottyhq also, we only have GPU quota for 2 concurrent GPU instances now. Can you tell me how many you would want to support, and I'll ask for a quota increase right away?

scottyhq · 2022-05-26T15:22:05Z

Awesome! will kick the tires today.

Can you tell me how many you would want to support, and I'll ask for a quota increase right away?

Ideally we want to support up to 100 simultaneous users. But we'll make due with whatever we get. A minimum viable number would be ~20 (where one person per group in the hackweek has guaranteed access to a GPU node).

yuvipanda · 2022-05-26T15:29:12Z

Accidental close!

@scottyhq ok I'll ask! Note that the GPU nodes will only have 4CPUs & about 55G of RAM. Is that ok? And I'm getting K80 GPUs

scottyhq · 2022-05-26T15:33:32Z

CPU and RAM is fine, generally I just assume the instance resource ratios are optimized in some way for most workloads. Not too picky about GPU type, but it would be nice to have more modern options (T4, A100, V100), and perhaps a second type would also help with increased quota.

damianavila · 2022-05-30T22:24:21Z

@yuvipanda, do you have any news about the quota increase you have requested?

yuvipanda · 2022-06-01T11:19:29Z

@damianavila @scottyhq so i heard back, and our quota increase was approved only until 32 - so that gives us just 8 GPUs :( The quotas are for 'P and G' types together, and that's all the GPU types - so we can't split it up among multiple types either. I also asked if paying for premium support would increase the chances of the quota being granted and was told no it would not.

The specific response is:

Hello,

Thank you for your patience. We partially fulfilled your quota increase request. Your new quota for All P and G instances is 32.

We can reassess a higher quota increase at a later stage. In the meantime, consider alternative instance types, or spreading your instances across AWS Regions.

For a full list of our alternative instance types, see:
http://aws.amazon.com/ec2/instance-types/

To avoid processing delays, submit a quota increase requests in the Service Quotas console:
-https://console.aws.amazon.com/servicequotas/home

Feel free to ask if you have any other questions.

Have an awesome week ahead.

Please let us know if we helped resolve your issue:

This is a bit frustrating, as I had explained to them why we wanted the quota increase we did.

yuvipanda · 2022-06-01T11:36:40Z

Multiple regions doesn't really work for us since we want to stay in us-west-2, and our home directories are there. We could ask for a spot instance quota increase too maybe?

damianavila · 2022-06-01T23:39:59Z

We could ask for a spot instance quota increase too maybe?

But that would help with the imposed limit?
I guess they are trying to "force" us to use a bigger P instance?

yuvipanda · 2022-06-02T03:35:03Z

@damianavila the quota is for CPU count so bigger instances won't help

damianavila · 2022-06-02T16:09:20Z

OK... let me see if I understand this...
We currently have a p2.xlarge instance where the original quota was 8 CPU which translates to 8/4 = 2 GPU.

You requested a raise and they give you 32 CPUs which translates to 32/4 = 8 GPU.
That is somehow expected if you see the available instances from the above table.
They want to avoid multiple tiny boxes and, instead, "forcing" you through the quota to use a bigger instance.

If you use the p2.8xlarge, you will have the 32 CPU and 8 GPU but your quota will most likely? (I am supposing that will be the case, I might be wrong, it should be something to ask their support team) be bigger than that, I presume between 32 and 64 (with a max 64)... because at 64 CPU they will "force" you again to use p2.16xlarge.

Does it makes sense or I am talking nonsense 😉?

yuvipanda · 2022-06-02T16:15:45Z

The ratio of CPU to GPU is 4 for all p2 instances, so we can only get 8 GPUs with 32 CPU quota regardless of the size of instances we use. So we can get 32 total CPUs, which will only provide us with at most 8GPUs in whatever configuration.

damianavila · 2022-06-02T16:22:52Z

OK... but that is assuming you have the 32 quota fixed across the instances... what I am saying here is that you have that quota fixed for the smaller instances BUT I think they might raise the quota limit if you use a bigger instance.
For instance, if you use the larger instance with 64 CPUs, a fixed quota of 32 is totally non-sense! Why are you paying for an instance where you can use half of the available GPUs?

yuvipanda · 2022-06-02T16:37:15Z

@damianavila quota is fixed across the instance family - this is for all P and G type instances put together. With a 32 quota, trying to launch a instance with 64 CPUs will just fail with an error message about not enough quotas.

damianavila · 2022-06-02T16:43:38Z

With a 32 quota, trying to launch a instance with 64 CPUs will just fail with an error message about not enough quotas.

But you agree with me this is totally nonsense, right? Or am I missing something else?

yuvipanda · 2022-06-02T16:48:00Z

@damianavila that they gave us only 32 CPU quota (given I had asked for 400) is definitely nonsense and I agree! But I'm not sure I fully understand what the suggested next step is. In particular, I don't think us trying to use different instance sizes matters - so I'm a little confused there!

damianavila · 2022-06-02T18:59:44Z

@yuvipanda, I would probably further ask AWS support IF changing the instance to a bigger one actually gives us the opportunity to additionally being granted with a higher quota/limit (more than 32).
To be clear, I am not suggesting going with a bigger instance until we have explicit confirmation from AWS that "they will raise the quota to more than 32 CPUs if we run with a bigger instance".
Sorry if I am not being clear enough with my writings... and feel free to ignore if you think this is a dead-end road.

scottyhq · 2022-06-02T23:07:02Z

Frustrating that the quota is so low! I'd reply with something like the following:

We've looked into alternative instance types but our use-case requires NVIDIA GPUs and Intel CPUs (https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html) for software compatibility. The NASA datasets we are working with are very large and hosted in us-west-2 so it is not cost-effective nor efficient to launch instances in multiple regions (https://www.earthdata.nasa.gov/learn/articles/eosdis-data-cloud-user-requirements).

We are provisioning servers for the 100s of scientists for research and educational events, but with the current quota only 8 have access to a GPU. Also, in order to compare performance of various instance options we need at least a quota to run the maximum size of P2,P3,G4 instances, and preferably more in order to run instances simultaneously, which is why we again request a quota increase to 256 vCPU.

yuvipanda · 2022-06-03T05:27:16Z

@damianavila I can ask them specifically, but based on my experience so far I think it's a dead end. They generally don't care how many instances you use the CPUs in. But if your experience with AWS has been different, I can ask them!

damianavila · 2022-06-03T15:48:10Z

But if your experience with AWS has been different, I can ask them!

In general, I did not have a different experience than you with their support.

yuvipanda · 2022-06-22T15:53:02Z

We got bigger GPU quota thanks to @cgentemann!

I think this issue can be closed now.

choldgraf · 2022-06-22T16:23:17Z

Follow-up: I opened up the following issue to discuss these issues around new AWS organizations and resource limits/quotas:

[discuss]: creating cloud organizations for communities and cloud resource limits #1452

damianavila added the type: hub label May 13, 2022

damianavila added this to DEPRECATED Engineering and Product Backlog May 13, 2022

damianavila assigned yuvipanda May 13, 2022

damianavila moved this to Ready to work in DEPRECATED Engineering and Product Backlog May 13, 2022

damianavila moved this from Ready to work to In progress in DEPRECATED Engineering and Product Backlog May 16, 2022

yuvipanda mentioned this issue May 17, 2022

Support GPU on eksctl created AWS instances #1314

Merged

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue May 23, 2022

Add scratch bucket functionality for AWS

df355e6

- Bump up AWS provider version, as there had been a few deprecations in the IAM resources - Mimic the GCP setup as much as possible Ref 2i2c-org#1309

yuvipanda mentioned this issue May 24, 2022

Add scratch bucket functionality for AWS #1322

Merged

yuvipanda mentioned this issue May 26, 2022

Add snowex hub to uwhackweeks cluster #1335

Merged

yuvipanda mentioned this issue May 26, 2022

[Decommission Hub] uwhackweeks.2i2c.cloud #1336

Closed

9 tasks

yuvipanda closed this as completed May 26, 2022

Repository owner moved this from In progress to Complete in DEPRECATED Engineering and Product Backlog May 26, 2022

yuvipanda reopened this May 26, 2022

damianavila moved this from Complete to In progress in DEPRECATED Engineering and Product Backlog May 30, 2022

yuvipanda closed this as completed Jun 22, 2022

Repository owner moved this from In progress to Complete in DEPRECATED Engineering and Product Backlog Jun 22, 2022

choldgraf mentioned this issue Jun 22, 2022

[discuss]: creating cloud organizations for communities and cloud resource limits #1452

Open

damianavila mentioned this issue Jul 11, 2022

[blog] Quarter 2 update 2i2c-org/team-compass#452

Closed

6 tasks

alisonrgray mentioned this issue Oct 25, 2022

[New Hub] gridSST Hack-A-Thon #1650

Closed

9 tasks

consideRatio mentioned this issue Jan 9, 2023

[Decommission Hub] snowex.uwhackweek #2031

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Hub] University of Washington - NASA SnowEx Hackweek 2022 #1309

[New Hub] University of Washington - NASA SnowEx Hackweek 2022 #1309

damianavila commented May 13, 2022 •

edited

Loading

yuvipanda commented May 13, 2022

yuvipanda commented May 13, 2022

yuvipanda commented May 13, 2022

damianavila commented May 16, 2022

yuvipanda commented May 17, 2022

scottyhq commented May 24, 2022

yuvipanda commented May 25, 2022

scottyhq commented May 25, 2022

yuvipanda commented May 25, 2022

yuvipanda commented May 26, 2022

yuvipanda commented May 26, 2022

yuvipanda commented May 26, 2022

yuvipanda commented May 26, 2022

scottyhq commented May 26, 2022

yuvipanda commented May 26, 2022

scottyhq commented May 26, 2022

damianavila commented May 30, 2022

yuvipanda commented Jun 1, 2022

yuvipanda commented Jun 1, 2022

damianavila commented Jun 1, 2022

yuvipanda commented Jun 2, 2022

damianavila commented Jun 2, 2022

yuvipanda commented Jun 2, 2022

damianavila commented Jun 2, 2022

yuvipanda commented Jun 2, 2022

damianavila commented Jun 2, 2022

yuvipanda commented Jun 2, 2022

damianavila commented Jun 2, 2022 •

edited

Loading

scottyhq commented Jun 2, 2022 •

edited

Loading

yuvipanda commented Jun 3, 2022

damianavila commented Jun 3, 2022

yuvipanda commented Jun 22, 2022

choldgraf commented Jun 22, 2022

[New Hub] University of Washington - NASA SnowEx Hackweek 2022 #1309

[New Hub] University of Washington - NASA SnowEx Hackweek 2022 #1309

Comments

damianavila commented May 13, 2022 • edited Loading

Hub Description

Community Representative(s)

Important dates

Hub Authentication Type

Hub logo information

Hub user image

Extra features you'd like to enable

Other relevant information

Hub URL

Hub Type

Tasks to deploy the hub

yuvipanda commented May 13, 2022

yuvipanda commented May 13, 2022

yuvipanda commented May 13, 2022

damianavila commented May 16, 2022

yuvipanda commented May 17, 2022

scottyhq commented May 24, 2022

yuvipanda commented May 25, 2022

scottyhq commented May 25, 2022

yuvipanda commented May 25, 2022

yuvipanda commented May 26, 2022

yuvipanda commented May 26, 2022

yuvipanda commented May 26, 2022

yuvipanda commented May 26, 2022

scottyhq commented May 26, 2022

yuvipanda commented May 26, 2022

scottyhq commented May 26, 2022

damianavila commented May 30, 2022

yuvipanda commented Jun 1, 2022

Please let us know if we helped resolve your issue:

yuvipanda commented Jun 1, 2022

damianavila commented Jun 1, 2022

yuvipanda commented Jun 2, 2022

damianavila commented Jun 2, 2022

yuvipanda commented Jun 2, 2022

damianavila commented Jun 2, 2022

yuvipanda commented Jun 2, 2022

damianavila commented Jun 2, 2022

yuvipanda commented Jun 2, 2022

damianavila commented Jun 2, 2022 • edited Loading

scottyhq commented Jun 2, 2022 • edited Loading

yuvipanda commented Jun 3, 2022

damianavila commented Jun 3, 2022

yuvipanda commented Jun 22, 2022

choldgraf commented Jun 22, 2022

damianavila commented May 13, 2022 •

edited

Loading

damianavila commented Jun 2, 2022 •

edited

Loading

scottyhq commented Jun 2, 2022 •

edited

Loading