Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation on preparing for exams #2449

Merged
merged 6 commits into from
Dec 13, 2023
Merged

Conversation

yuvipanda
Copy link
Member

@yuvipanda yuvipanda commented Mar 31, 2023

Based on our experience with
#1905.

Trying to set this up for #2316


📚 Documentation preview 📚: https://2i2c-pilot-hubs--2449.org.readthedocs.build/en/2449/

@yuvipanda yuvipanda mentioned this pull request Mar 31, 2023
9 tasks
@yuvipanda yuvipanda requested a review from a team March 31, 2023 11:20
Copy link
Member

@GeorgianaElena GeorgianaElena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yuvipanda for documenting this!

docs/howto/features/exam.md Outdated Show resolved Hide resolved
Copy link
Contributor

@pnasrat pnasrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this isn't detailed enough for a runbooks/checklist, particularly for new engineers. I've added some comments that would improve that

This page documents what we do to prep, based on our prior experiences.

1. Make sure the exact dates and times of the exam are checked well in
advance, and we have enough engineering coverage during this time period.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a clearer definition of enough coverage


1. Make sure the exact dates and times of the exam are checked well in
advance, and we have enough engineering coverage during this time period.
Engineers should also *test* their access to the infrastructure and the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific here with a checklist maybe

  • Can access and login to the hub admin page (see https://infrastructure.2i2c.org/en/latest/reference/hubs.html for links)
  • Can access and login to the cluster grafana
  • Can access and login to the cloud console
  • Test access to Logs Explorer for container logs if on GCP
  • Test deployer use-cluster-credentials $CLUSTER and then kubectl get pods -A work

hub beforehand, to make sure they can fix issues if needed.

2. For the duration of the exam, all user pods must have a
[guaranteed quality of service class](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work with nodesharing setups or do educational hubs not have that?

Can this be set through a configuration switch.

the start of the exam. It should be reverted back soon after the exam
is done.

3. The instructor running the exam should test out their exam on the hub,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an SRE my action is unclear for this point - who is responsible for facilitating this and signing this off - is it engineering, communities & partnerships, or the community themselves.

versions, etc) are set up appropriately. From the time they test this until
the exam is over, new environment changes are put on hold.

4. We should pre-warm the cluster the hub is on before the start of the exam,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear what the process is from this item. Is there a link to existing docs on pre-warming - what do you mean here overprovisioning, increasing node pool size, some form of cache warming?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I translate the practical steps for this to calculating how many nodes are needed for the user count we guarantee can startup fluently, and then reconfiguring a min_nodes configuration of some kind temporarily. This can be done from terraform or in other ways, and having guidance on that could also be relevant.

For this, we should probably also have the image configured in singleuser.image and prePuller.continuous.enabled to ensure all started nodes have the image that will be used ahead of users arriving to them.

another.

5. Issues during the exam are communicated via freshdesk, and what we are paid
for is to make sure we respond immediately - there is no guarantee of fixes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expectation really immediate. Ie if one engineer in Eastern timezone am I allowed to go to the bathroom without my phone, have lunch? Normally I'd expect support/page response levels to have clearer guidelines

5 minute acknowledge, 15 minutes to start investgation.

If engineers are working out of hours for this is there overtime/coverage compensation beyond salary/contracted hours?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a quick response, the current idea is to make sure engineers are not working out of hours for this. And if we can't provide coverage for an event without requiring engineers to work out of hours, we don't accept the event. When we do, we definitely must provide some form of coverage compensation, I agree.


5. Issues during the exam are communicated via freshdesk, and what we are paid
for is to make sure we respond immediately - there is no guarantee of fixes,
although we try very hard to make sure the infrastructure is stable during this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are deployments frozen? is this done in a way that eg a change by an engineer not coverfing the exam would not cause a deploy to the cluster/hub the exam is in (eg adding a new python file to deployer/).

How is stability enforced through configuration or process. How is this communicated

@yuvipanda
Copy link
Member Author

Thanks for the detailed feedback, @pnasrat! I'll spend more time on it this week.

@yuvipanda
Copy link
Member Author

I don't think I'll have time to push this further right now :(

@GeorgianaElena
Copy link
Member

@2i2c-org/engineering, I've pushed some commits that hopefully address most of the comments. The ones it does not are the ones related to processes we have not set yet and I believe deserve more thought and a different "home" than the infra docs.

Related to: #3522

@yuvipanda
Copy link
Member Author

OMG AMAZING WORK @GeorgianaElena!!!!

@yuvipanda yuvipanda merged commit d335408 into 2i2c-org:master Dec 13, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done 🎉
Development

Successfully merging this pull request may close these issues.

5 participants