Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INCREASED HUB ACTIVITY] 2i2c climatematch #2753

Closed
GeorgianaElena opened this issue Jul 3, 2023 · 17 comments
Closed

[INCREASED HUB ACTIVITY] 2i2c climatematch #2753

GeorgianaElena opened this issue Jul 3, 2023 · 17 comments
Assignees

Comments

@GeorgianaElena
Copy link
Member

Summary

The Climatematch community is expecting an increased hub activity starting with Wednesday this week until August 1st.

Event Info

There are multiple triggers for the increased hub activity as specified in the https://2i2c.freshdesk.com/a/tickets/804 ticket:

  • [additional nodes requested]: July 5-6 (Wednesday-Thursday this week) they are holding an instructor training event where they expect about 50 logins intermittently throughout the day.
  • [no additional nodes requested] Next week they will be adding over 500 students to the course GitHub team and ask them to test access into the hub.
  • [no additional nodes requested] The course officially begins on July 17. Between students, instructors, and content creators, thet expect about 700 active users during the two weeks of the live course.
  • Activity will likely drop after August 1 but we will monitor usage after that, with potential renewal until the end of December to allow students to continue to work on projects.

Hub info

@pnasrat
Copy link
Contributor

pnasrat commented Jul 3, 2023

I'll handle the increase in nodes for Wednesday.

@GeorgianaElena
Copy link
Member Author

Thank you @pnasrat!

@pnasrat pnasrat changed the title [INCREASED HUB ACTIVITY] 2i2c-climatech [INCREASED HUB ACTIVITY] 2i2c climatematch Jul 4, 2023
@pnasrat
Copy link
Contributor

pnasrat commented Jul 4, 2023

Looking at the climate match configuration. kubespawner allocates notebook servers with a limit of 7G and guarantee of 5G onto a pool of n1-highmem-2 2vcPUs 13G, so we can only fit 2 users per node at guarantee but if both burst memory it'll cause a reschedule.

That seems low. I'm wondering why this isn't larger nodes more densely packed with some overhead capacity

700 active users at peak would need 700 nodes which isn't ideal in my mind.

. A user profile isn't suitable for this setup as it's a class. CC @consideRatio for thoughts on sizing

@pnasrat
Copy link
Contributor

pnasrat commented Jul 4, 2023

Created initial pull request in #2757 to handle the instructor case but we should resolve the plan for actual node sizing for next week and the event itself.

@consideRatio
Copy link
Contributor

That seems low. I'm wondering why this isn't larger nodes more densely packed with some overhead capacity

Legacy reasons I presume. When there are 100+ users, having at minimum 10 users per node is essential to not run out of quotas of node disks or public IPs and avoid startup times I'd say.

we should resolve the plan for actual node sizing for next week and the event itself.

I'm approving #2757, but like you I think there is a lot of room for improvements. I'd say that we should go for n2-highmem-16 machines with 128 GB RAM if the requests/limits are 5/7GB.

@damianavila damianavila moved this from Todo 👍 to In Progress ⚡ in Sprint Board Jul 5, 2023
@damianavila damianavila moved this from Needs Shaping / Refinement to In progress in DEPRECATED Engineering and Product Backlog Jul 5, 2023
@yuvipanda
Copy link
Member

@pnasrat @consideRatio I've documented that here now: #2765. I agree we should make the node bigger now.

@pnasrat
Copy link
Contributor

pnasrat commented Jul 7, 2023

Note they have also requested notebook memory to be 16G

@yuvipanda you mentioned quota issues for using n2 machines yesterday do you have more specific information on that?

128/16 minus some overhead means about 7 notebooks for nodes so for concurrent 700 users we'd need 100 nodes for a synchronous workshop

It's unclear if 700 active users is sustained load or it'll be more intermittent usage of 700 unique users.

pnasrat added a commit that referenced this issue Jul 7, 2023
@pnasrat
Copy link
Contributor

pnasrat commented Jul 7, 2023

I've created #2775 with n1-highmem-32 (which would support 10 user servers per node, happy for feedback on machine type. I've also communicated with the community via support that we'll be changing the pool. Once that's done I'll increase the memory request to 16G as requested

@consideRatio
Copy link
Contributor

I requested a quota increase for n2 machines, if that arrives, then I suggest use of n2-highmem over n1 because they are faster and cheaper per CPU performance and per RAM, where its especially cheaper when considering RAM. n1-highmem has 1:6.5 in CPU:RAM while n2-highmem has a 1:8 CPU:RAM ratio.

A less impactful idea in my mind is to have affinity towards choosing subset of all the available n2-highmem machines, specifically 4 / 16 / 64 CPU, just to reduce some complexity and simplify offerings and such - not because its critical.

pnasrat added a commit that referenced this issue Jul 7, 2023
@consideRatio
Copy link
Contributor

Approved! Apparently the same minute it was received.

image

For reference about quotas, the reason we had to request it is related to this project was relatively old. Newer project was initialized with a non-0 quota. For example a new project i created had 300 n2 and n1 quota in a few regions, but had 600 n1 quota in a few as well.

@pnasrat
Copy link
Contributor

pnasrat commented Jul 11, 2023

Looks like they've had fairly constant usage over the last 24 hours but not spiking at 500 concurrent but more like 10 so the single larger node is handling that with room to grow.

image

@pnasrat
Copy link
Contributor

pnasrat commented Jul 11, 2023

Recommendation for early morning America/New_York TZ on 2023-07-17 to grow the min pool to 10, I'll be on support so I can take that.

@pnasrat
Copy link
Contributor

pnasrat commented Jul 19, 2023

@damianavila assigning to you for potential end of event eng work after August 1

@damianavila damianavila moved this from In Progress ⚡ to Waiting 🕛 in Sprint Board Jul 25, 2023
@damianavila damianavila removed their assignment Jul 27, 2023
@damianavila
Copy link
Contributor

Given that we are past August 1st, we should probably go back to the previous state although it is not clear to me what that previous state would look like. It seems we had these PRs related to the increased activity:

It would be nice to get some agreement from @2i2c-org/engineering about the state we want to achieve before @GeorgianaElena works on it (btw, Georgiana, if you are opinionated about this one, feel free to move forward when you have some time).

@GeorgianaElena
Copy link
Member Author

GeorgianaElena commented Aug 4, 2023

@damianavila, I will get the min node pool size back to zero (already zero, so no nodes pre-created), to reduce costs but will keep the same memory guarantees, in case any of the users expects to be able to re-run the same workflows.
Also, there still seem to be some users left:

Screenshot 2023-08-04 at 09 04 17

Also, we might want to also double check with them the decomission date, which seems to be in two weeks per https://github.com/2i2c-org/leads/issues/110

@GeorgianaElena
Copy link
Member Author

GeorgianaElena commented Aug 4, 2023

Ok, so I've been going through tickets and PRs related to this event and my opinion about what the next steps should be is that we should:

  • do nothing with the current state of the infra for climatematch given that:
    - the min nodepool size is zero, which is ok
    - they have bigger machines for the nodes, that fit about 10 users, but because they are still actively using the hub (see the graph above) I believe that's ok
  • have partnership check back on them about:
    • a renewal of their contract, which ends in two weeks
    • what's their expected resource usage outside of this event if they decide to continue using the hub (how many concurrent users, how much mem)

Note that there's also an ongoing support discussion about enabling other features on this hub.

I will close this issue now and ping partnerships about this lead in https://github.com/2i2c-org/leads/issues/110.

Anyone on @2i2c-org/engineering , please feel free to re-open if you don't agree with the conclusion about keeping the current state of the infra for this hub.

@github-project-automation github-project-automation bot moved this from Waiting 🕛 to Done 🎉 in Sprint Board Aug 4, 2023
@damianavila
Copy link
Contributor

Awesome, @GeorgianaElena, I concur with your opinions and actions so far. Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants