-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debugging previously working run with Flux! #429
Comments
Well, it looks like the FLUX_URI info messages are both a different bug and a red herring -> the second one should be wrapped in an else as it gets called regardless of whether a uri was found or not! Thanks for pointing that out! (see offending code here:
The import error looks to be the real problem here. Not sure how the pythonpath would get changed once maestro starts running, so it might just be borked from the start? The uri gets sourced before it tries connecting to any broker (via the env var), so don't think that tells us much about the state of things :/. Can you say more about how this is being run? is this launching a container, building the maestro env and then running maestro inside it, or are you trying to have maestro external to the container schedule to a broker living inside it? (if the latter, that'll be a new one for me for how to wire up the bindings...). Could you maybe try just having a non-maestro script talk to the python bindings in place of the call to 'maestro run -fg ...' and see if that can import flux so we can rule out an environment issue? |
Sure! So I have this container: https://github.com/rse-ops/hpc-apps/blob/main/maestro-lulesh/Dockerfile and we are using the flux operator with a new design that we stay in that container (and don't need flux there). Flux is added as an on demand view, and when we run the maestro command, this: maestro run -fg ./lulesh-flux.yaml -y I confirmed that PYTHONPATH had both the path for maestro (which is in dist-packages) and for flux. Same for the path. So to answer your question, we are just running a container with maestro and flux added on the fly, but the flux python paths are different than where maestro is installed (and that might be the issue). I could try installing maestro to the same flux python emnvironment if that might help debug? |
Hmmm. I don't know that you need full flux itself installed in the same env as maestro but rather just the bindings. Can you add an extra bit in that container to also |
Yep will try that and report back. |
okay that won't work because we don't have flux-security in these containers. Long story, but it was easier to not have the whole munge thing. Collecting flux-python==0.51.0
Using cached flux-python-0.51.0.tar.gz (217 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [1 lines of output]
Cannot find flux security under expected path /mnt/flux/view/include/flux/security
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details. |
I can look into if we can ease that requirement. |
Ok, how about this more immediate bandaid: Get one of the required dependencies for the bindings: Wire up the python path automatically using awk and flux's exported environment vars: That was the old solution before the lovely python bindings were available separately. EDIT: remove obnoxious heading formatting |
haha thank you! I should have a tweaked pip install version pushed shortly, so I want to try that first. |
sweet! |
Hmm.. maybe try turning debug logging on: Additionally, what about trying a serial spec to see if theres mpi related issues in here somehow: https://github.com/LLNL/maestrowf/blob/develop/samples/hello_world/hello_bye_parameterized_flux.yaml ? |
Well, that's not really a readable handle address is it... As for the why it's choking, can you write up a standalone script using the same from nest command and see if we can even submit the job script manually? Not entirely familiar with the part of the stack trace inside flux, so i'll need to go dig around in there to see what might trigger that (maybe a malformed job script/spec?) as maestro's exceptions don't really say much about what's going on. |
I think I'm probably going to drop working on this for now, at least until someone else requests the example to work again, because it's one workflow of (a potential) many that would be nice to have working as it did before, but shouldn't block to working on other things. Let's leave the issue open in case anyone else has insights. Thanks for the help @jwhite242 ! |
Hi folks! I rebuilt a container with a newer maestrowf, and I'm having trouble reproducing a previously working run. I think it might be related to the PYTHONPATH and FLUX_URI - it appears that we first find a both, but then when maestro runs it doesn't seem to be able to import flux (suggesting the PYTHONPATH was altered). Here are the full logs, and I'll try to annotate them a bit.
Can we talk about what the steps / flow of logic is between that first FLUX_URI being found and the second? If the second isn't finding flux because the PYTHONPATH isn't being passed forward, that might be the bug?
For context, here is the workflow I'm running:
The text was updated successfully, but these errors were encountered: