-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation on preparing for exams #2449
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @yuvipanda for documenting this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of this isn't detailed enough for a runbooks/checklist, particularly for new engineers. I've added some comments that would improve that
docs/howto/features/exam.md
Outdated
This page documents what we do to prep, based on our prior experiences. | ||
|
||
1. Make sure the exact dates and times of the exam are checked well in | ||
advance, and we have enough engineering coverage during this time period. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a clearer definition of enough coverage
docs/howto/features/exam.md
Outdated
|
||
1. Make sure the exact dates and times of the exam are checked well in | ||
advance, and we have enough engineering coverage during this time period. | ||
Engineers should also *test* their access to the infrastructure and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more specific here with a checklist maybe
- Can access and login to the hub admin page (see https://infrastructure.2i2c.org/en/latest/reference/hubs.html for links)
- Can access and login to the cluster grafana
- Can access and login to the cloud console
- Test access to Logs Explorer for container logs if on GCP
- Test
deployer use-cluster-credentials $CLUSTER
and thenkubectl get pods -A
work
docs/howto/features/exam.md
Outdated
hub beforehand, to make sure they can fix issues if needed. | ||
|
||
2. For the duration of the exam, all user pods must have a | ||
[guaranteed quality of service class](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work with nodesharing setups or do educational hubs not have that?
Can this be set through a configuration switch.
docs/howto/features/exam.md
Outdated
the start of the exam. It should be reverted back soon after the exam | ||
is done. | ||
|
||
3. The instructor running the exam should test out their exam on the hub, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an SRE my action is unclear for this point - who is responsible for facilitating this and signing this off - is it engineering, communities & partnerships, or the community themselves.
docs/howto/features/exam.md
Outdated
versions, etc) are set up appropriately. From the time they test this until | ||
the exam is over, new environment changes are put on hold. | ||
|
||
4. We should pre-warm the cluster the hub is on before the start of the exam, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear what the process is from this item. Is there a link to existing docs on pre-warming - what do you mean here overprovisioning, increasing node pool size, some form of cache warming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I translate the practical steps for this to calculating how many nodes are needed for the user count we guarantee can startup fluently, and then reconfiguring a min_nodes
configuration of some kind temporarily. This can be done from terraform or in other ways, and having guidance on that could also be relevant.
For this, we should probably also have the image configured in singleuser.image and prePuller.continuous.enabled to ensure all started nodes have the image that will be used ahead of users arriving to them.
docs/howto/features/exam.md
Outdated
another. | ||
|
||
5. Issues during the exam are communicated via freshdesk, and what we are paid | ||
for is to make sure we respond immediately - there is no guarantee of fixes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the expectation really immediate. Ie if one engineer in Eastern timezone am I allowed to go to the bathroom without my phone, have lunch? Normally I'd expect support/page response levels to have clearer guidelines
5 minute acknowledge, 15 minutes to start investgation.
If engineers are working out of hours for this is there overtime/coverage compensation beyond salary/contracted hours?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a quick response, the current idea is to make sure engineers are not working out of hours for this. And if we can't provide coverage for an event without requiring engineers to work out of hours, we don't accept the event. When we do, we definitely must provide some form of coverage compensation, I agree.
docs/howto/features/exam.md
Outdated
|
||
5. Issues during the exam are communicated via freshdesk, and what we are paid | ||
for is to make sure we respond immediately - there is no guarantee of fixes, | ||
although we try very hard to make sure the infrastructure is stable during this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are deployments frozen? is this done in a way that eg a change by an engineer not coverfing the exam would not cause a deploy to the cluster/hub the exam is in (eg adding a new python file to deployer/).
How is stability enforced through configuration or process. How is this communicated
Thanks for the detailed feedback, @pnasrat! I'll spend more time on it this week. |
I don't think I'll have time to push this further right now :( |
Based on our experience with 2i2c-org#1905. Trying to set this up for 2i2c-org#2316
Co-authored-by: Georgiana <[email protected]>
1d73bb0
to
aec9628
Compare
@2i2c-org/engineering, I've pushed some commits that hopefully address most of the comments. The ones it does not are the ones related to processes we have not set yet and I believe deserve more thought and a different "home" than the infra docs. Related to: #3522 |
OMG AMAZING WORK @GeorgianaElena!!!! |
Based on our experience with
#1905.
Trying to set this up for #2316
📚 Documentation preview 📚: https://2i2c-pilot-hubs--2449.org.readthedocs.build/en/2449/