-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[not for merge] benc poking at CI hangs in flux test #3259
Conversation
@mercybassey looks like I managed to make the hang reproducible both in CI and running the equivalent commands in docker on my laptop - this branch is a fork of your flux testing PR. |
@jameshcorbett @vsoch hi, not sure if you're interested in digging into this at the moment, but this branch runs flux tests based on @mercybassey 's PR #3159 which was hanging occasionally - I fixed the seed of test order randomisation and added a bit of logging and it looks like it hangs every time now in parsl/tests/test_flux.py::test_affinity. I also ran this same flux container on my laptop and get the same hang every time there too. So there's something reproducible here. I put some log statements in in order to try to trace what's happening and it seems to non-deterministically hang somewhere around the point that
So I think this is something weird happening inside Flux proper rather than in the parsl.executors.flux flux executor, but I don't have enough feeling for whats really meant to be happening here to give a decent diagnosis... If you're interested, I think you should be able to recreate the hang using the commands in the parsl+flux.yaml github actions workflow that this PR adds - it's what I did on my laptop. Let me know if there's anything I can do to get more useful information. |
My guess would be using that function you don't have any cores you are allowed to run on, so it doesn't run (and hangs). https://stackoverflow.com/questions/64189176/os-sched-getaffinity0-vs-os-cpu-count
There might be more information in that issue, not really sure what you are testing there, but asking to submit a job that asks for a value (the affinity) that will hang / not return in the case that there are none available in a test environment with a bunch of other stuff running (and using up the threads) smells funny. |
That test_affinity test does pass in some situations (it was hard for us to get it to happen in CI until I realised it seems to be test order dependent) and especially if I run just that test on its own, it seems to pass just fine. With test_affinity running at the end of the test sequence, using --random-order-seed=893320, which is how I got the order that seems to hang for me often: If replace sched_getaffinity with eg.
If I drop cores_per_task to 1, I still get hangs. If I remove the core resource specification entirely, I still get hangs:
... and this very special horror: if I
the whole set of tests pass the first time. And then do not pass if I re-run inside the same container. Until I touch test_flux.py again. I'm not really clear what effect changing the time metadata on that test_flux.py file would have, though... So I don't think this is directly anything to do with that os.sched_getaffinity call, and more to do with ... something else? This test_flux.py came as part of the PR #2051 contribution of the parsl/flux executor from @jameshcorbett and I think I've never tried running it before the last week or so - so I don't have any feel for what could go wrong here. |
Sorry can’t add more insight here - I don’t really understand this test. |
No description provided.