Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][Azure] Fix azure race condition #1428

Closed
wants to merge 1 commit into from

Conversation

suquark
Copy link
Collaborator

@suquark suquark commented Nov 17, 2022

This PR fixes some race condition mentioned in #1408. However, these race conditions should be pretty rare, so I am not sure if they are the root causes (from my understanding, it is more likely due to operating the same cluster at the same time, but I cannot find related code in the test). For me I cannot reproduce any of those errors in #1408 after this fix.

Closes #1408

@suquark suquark requested a review from Michaelvll November 17, 2022 21:01
@Michaelvll
Copy link
Collaborator

Thanks for the quick fix @suquark! Did you tried tests/run_smoke_tests.sh test_azure_start_stop? The error seems happening to me pretty often with that test.

@suquark
Copy link
Collaborator Author

suquark commented Nov 17, 2022

Thanks for the quick fix @suquark! Did you tried tests/run_smoke_tests.sh test_azure_start_stop? The error seems happening to me pretty often with that test.

I run test_azure_start_stop & test_cancel_azure at the same time and cannot reproduce the error. Also do you mean you can trigger the error with just a single test instance of test_azure_start_stop?

@Michaelvll
Copy link
Collaborator

Michaelvll commented Nov 17, 2022

I run test_azure_start_stop & test_cancel_azure at the same time and cannot reproduce the error. Also do you mean you can trigger the error with just a single test instance of test_azure_start_stop?

Yea, it seems that I can trigger it with one test. Let me try the latest fix to see if that helps.

Btw, we already have the @synchronized for the _get_filtered_nodes, will the current fix cause deadlock?

@suquark
Copy link
Collaborator Author

suquark commented Nov 17, 2022

I run test_azure_start_stop & test_cancel_azure at the same time and cannot reproduce the error. Also do you mean you can trigger the error with just a single test instance of test_azure_start_stop?

Yea, it seems that I can trigger it with one test. Let me try the latest fix to see if that helps.

Btw, we already have the @synchronized for the _get_filtered_nodes, will the current fix cause deadlock?

No, because it uses RLock

@Michaelvll
Copy link
Collaborator

Michaelvll commented Nov 17, 2022

I just tried this fix, and seems the problem still exists. It seems the problem happens when the test try to sky start the cluster that was just stopped with sky stop.

@suquark suquark changed the title [Azure] Fix azure race condition [WIP][Azure] Fix azure race condition Nov 17, 2022
@suquark
Copy link
Collaborator Author

suquark commented Nov 17, 2022

Good to know. Let me keep digging.

@Michaelvll
Copy link
Collaborator

This issue has not been observed recently. Closing this for now. Please feel free to re-open it if it occurs again.

@Michaelvll Michaelvll closed this May 9, 2023
@Michaelvll Michaelvll deleted the fix_azure_race_condition branch December 18, 2024 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Azure] KeyError when launching Azure cluster
2 participants