Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not for merge] benc poking at CI hangs in flux test #3259

Closed
wants to merge 66 commits into from

Conversation

benclifford
Copy link
Collaborator

No description provided.

mercybassey and others added 30 commits March 6, 2024 09:14
@benclifford
Copy link
Collaborator Author

@mercybassey looks like I managed to make the hang reproducible both in CI and running the equivalent commands in docker on my laptop - this branch is a fork of your flux testing PR.

@benclifford
Copy link
Collaborator Author

@jameshcorbett @vsoch hi, not sure if you're interested in digging into this at the moment, but this branch runs flux tests based on @mercybassey 's PR #3159 which was hanging occasionally - I fixed the seed of test order randomisation and added a bit of logging and it looks like it hangs every time now in parsl/tests/test_flux.py::test_affinity. I also ran this same flux container on my laptop and get the same hang every time there too. So there's something reproducible here.

I put some log statements in in order to try to trace what's happening and it seems to non-deterministically hang somewhere around the point that flux.job.executor.FluxExecutor.__init__ is executed. The last log message is usually
but not always around these lines:

        self._shutdown_lock = threading.Lock()
        self._broken_event = threading.Event()
        self._shutdown_event = threading.Event()

So I think this is something weird happening inside Flux proper rather than in the parsl.executors.flux flux executor, but I don't have enough feeling for whats really meant to be happening here to give a decent diagnosis...

If you're interested, I think you should be able to recreate the hang using the commands in the parsl+flux.yaml github actions workflow that this PR adds - it's what I did on my laptop. Let me know if there's anything I can do to get more useful information.

@vsoch
Copy link
Contributor

vsoch commented Mar 18, 2024

My guess would be using that function you don't have any cores you are allowed to run on, so it doesn't run (and hangs). https://stackoverflow.com/questions/64189176/os-sched-getaffinity0-vs-os-cpu-count

At least on Linux I found this to mean that if none of the allowed cores is currently available, the thread of a child-process won't run, even if other, non-allowed cores would be idle. So "affinity" is a bit misleading here.

There might be more information in that issue, not really sure what you are testing there, but asking to submit a job that asks for a value (the affinity) that will hang / not return in the case that there are none available in a test environment with a bunch of other stuff running (and using up the threads) smells funny.

@benclifford
Copy link
Collaborator Author

That test_affinity test does pass in some situations (it was hard for us to get it to happen in CI until I realised it seems to be test order dependent) and especially if I run just that test on its own, it seems to pass just fine.

With test_affinity running at the end of the test sequence, using --random-order-seed=893320, which is how I got the order that seems to hang for me often:

If replace sched_getaffinity with eg. eval I still get hangs:

        future = executor.submit(eval, {"cores_per_task": 2}, "[1,2,3]")

        # future = executor.submit(os.sched_getaffinity, {"cores_per_task": 2}, 0)

If I drop cores_per_task to 1, I still get hangs.

If I remove the core resource specification entirely, I still get hangs:

        future = executor.submit(eval, {}, "[1,2,3]")
        # future = executor.submit(eval, {"cores_per_task": 1}, "[1,2,3]")
        # future = executor.submit(os.sched_getaffinity, {"cores_per_task": 2}, 0)      

... and this very special horror: if I $ touch parsl/tests/test_flux.py then the first time I run:

flux start pytest parsl/tests/test_flux.py --config local  --random-order --random-order-bucket=module --random-order-seed=893320 --full-trace --log-cli-level=DEBUG

the whole set of tests pass the first time. And then do not pass if I re-run inside the same container. Until I touch test_flux.py again. I'm not really clear what effect changing the time metadata on that test_flux.py file would have, though...

So I don't think this is directly anything to do with that os.sched_getaffinity call, and more to do with ... something else?

This test_flux.py came as part of the PR #2051 contribution of the parsl/flux executor from @jameshcorbett and I think I've never tried running it before the last week or so - so I don't have any feel for what could go wrong here.

@vsoch
Copy link
Contributor

vsoch commented Mar 19, 2024

Sorry can’t add more insight here - I don’t really understand this test.

@benclifford
Copy link
Collaborator Author

this resulted in PR #3517 - this #3259 PR is no longer needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants