-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: deadlock issue #9728
fix: deadlock issue #9728
Conversation
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9728 +/- ##
==========================================
- Coverage 53.44% 53.44% -0.01%
==========================================
Files 1257 1257
Lines 154350 154359 +9
Branches 3298 3298
==========================================
Hits 82500 82500
- Misses 71700 71709 +9
Partials 150 150
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Were you able to reproduce the issue? |
This PR seems to resolve the issue, which I have tested by helm installing Determined with this commit SHA onto a remote cluster and running the workflow that does produce the deadlock when a SHA from main is used with the helm install instead. |
Fantastic, thank you! What is the workflow that reproduces the issue? |
If you're running a remote GKE cluster: Create a workspace bound to an auto-created namespace and start a task (I used a JL instance) in that workspace. Then, delete the master deployment pod and wait for the k8s API server to recreate it. Once the pod is back up and running, open the "Edit Workspace" modal in the webUI and save the workspace as is. You'll see the loading sign, and then nothing will load after that 🙈😂 I image that a CLI command such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I just have a question about when we set j.namespacesWithInformers[namespace] = true
(line 260).
e2a8dc2
to
0832e0e
Compare
a6ff0b5
to
7e0cdb0
Compare
7e0cdb0
to
f50f82f
Compare
a325786
to
a90a6a1
Compare
1bebba5
to
ef3c3d1
Compare
ef3c3d1
to
4a2edd9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work.
# function that only gets called when a job is running in the namespace that we want to bind to the | ||
# workspace. Verify that we don't run into this deadlock when triggering multiple calls to the | ||
# API handler that binds a workspace to an auto-created namespace, as this request can trigger the | ||
# deadlock if the namespace (or verify the existence of for the first time) is running a job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this explanation! Great comment and great test.
@@ -35,9 +35,3 @@ stages: | |||
kubeconfig_path: ~/.kube/config | |||
determined_master_ip: $DOCKER_LOCALHOST | |||
determined_master_port: 8080 | |||
internal_task_gateway: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why is this deleted? Was it a bit of cleanup discovered during testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeaa, I think we only need to use a gateway for mutliRM to make tasks that create resources in the remote cluster (the cluster tied to the additional RM) accessible to the determined master, so single RM clusters don't need that config!
(cherry picked from commit 9dc0afa)
Ticket
DET-10441
Description
Fix deadlocking issue with creating or verifying existence of a kubernetes namespace
Test Plan
Checklist
docs/release-notes/
See Release Note for details.