Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging previously working run with Flux! #429

Open
vsoch opened this issue Nov 14, 2023 · 16 comments
Open

Debugging previously working run with Flux! #429

vsoch opened this issue Nov 14, 2023 · 16 comments

Comments

@vsoch
Copy link

vsoch commented Nov 14, 2023

Hi folks! I rebuilt a container with a newer maestrowf, and I'm having trouble reproducing a previously working run. I think it might be related to the PYTHONPATH and FLUX_URI - it appears that we first find a both, but then when maestro runs it doesn't seem to be able to import flux (suggesting the PYTHONPATH was altered). Here are the full logs, and I'll try to annotate them a bit.

# This is starting flux with 'flux start' which is what we call a launcher mode, expecting the workflow tool to run flux jobs, etc.
🌀 Launcher Mode: flux start -o --config /mnt/flux/view/etc/flux/config -Scron.directory=/etc/flux/system/cron.d   -Stbon.fanout=256   -Srundir=/mnt/flux/view/run/flux    -Sstatedir=/mnt/flux/view/var/lib/flux   -Slocal-uri=local:///mnt/flux/view/run/flux/local -Stbon.connect_timeout=5s    -Slog-stderr-level=6    -Slog-stderr-mode=local  
broker.info[0]: start: none->join 0.335857ms
broker.info[0]: parent-none: join->init 0.016684ms
cron.info[0]: synchronizing cron tasks to event heartbeat.pulse
job-manager.info[0]: restart: 0 jobs
job-manager.info[0]: restart: 0 running jobs
job-manager.info[0]: restart: checkpoint.job-manager not found
broker.info[0]: rc1.0: running /opt/software/linux-ubuntu20.04-x86_64/gcc-9.4.0/flux-core-0.54.0-dxw2ljadlubcgovsrnvijkrkywlgk2ex/etc/flux/rc1.d/02-cron
broker.info[0]: rc1.0: /opt/software/linux-ubuntu20.04-x86_64/gcc-9.4.0/flux-core-0.54.0-dxw2ljadlubcgovsrnvijkrkywlgk2ex/etc/flux/rc1 Exited (rc=0) 1.1s
broker.info[0]: rc1-success: init->quorum 1.12243s
broker.info[0]: online: flux-sample-0 (ranks 0)
broker.info[0]: online: flux-sample-[0-3] (ranks 0-3)

# This indicates the quorum is full
broker.info[0]: quorum-full: quorum->run 0.634584s

# Here is where I think maestro starts?
[2023-11-14 17:31:24: INFO] INFO Logging Level -- Enabled
[2023-11-14 17:31:24: WARNING] WARNING Logging Level -- Enabled
[2023-11-14 17:31:24: CRITICAL] CRITICAL Logging Level -- Enabled
[2023-11-14 17:31:24: INFO] Loading specification -- path = ./lulesh-flux.yaml
[2023-11-14 17:31:24: INFO] Directory does not exist. Creating directories to /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/logs
[2023-11-14 17:31:24: INFO] Adding step 'make-lulesh' to study 'lulesh_sample1'...
[2023-11-14 17:31:24: INFO] Adding step 'run-lulesh' to study 'lulesh_sample1'...
[2023-11-14 17:31:24: INFO] run-lulesh is dependent on make-lulesh. Creating edge (make-lulesh, run-lulesh)...
[2023-11-14 17:31:24: INFO] 
------------------------------------------
Submission attempts =       1
Submission restart limit =  1
Submission throttle limit = 0
Use temporary directory =   False
Hash workspaces =           False
Dry run enabled =           False
Output path =               /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124
------------------------------------------
[2023-11-14 17:31:24: INFO] Acquiring -- LULESH
[2023-11-14 17:31:24: INFO] Checking for connectivity to 'https://github.com/LLNL/LULESH.git'
[2023-11-14 17:31:24: INFO] Connectivity achieved!
[2023-11-14 17:31:24: INFO] Cloning 'LULESH' from 'https://github.com/LLNL/LULESH.git'...
[2023-11-14 17:31:25: INFO] Running Maestro Conductor in the foreground.
[2023-11-14 17:31:25: INFO] 
------------------------------------------
Submission attempts =       1
Submission throttle limit = 0
Use temporary directory =   False
Tmp Dir = 
------------------------------------------
[2023-11-14 17:31:25: INFO] 
==================================================
name: lulesh_sample1
description: A sample LULESH study that downloads, builds, and runs a parameter study of varying problem sizes and iterations on FLUX.
==================================================

[2023-11-14 17:31:25: INFO] 
==================================================
Constructing parameter study 'lulesh_sample1'
==================================================

[2023-11-14 17:31:25: INFO] 
==================================================
Processing step '_source'
==================================================

[2023-11-14 17:31:25: INFO] Encountered '_source'. Adding and continuing.
[2023-11-14 17:31:25: INFO] 
==================================================
Processing step 'make-lulesh'
==================================================

[2023-11-14 17:31:25: INFO] 
-------------------------------------------------
Adding step 'make-lulesh' (No parameters used)
-------------------------------------------------

[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = cd /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH
mkdir build
cd build
cmake -WITH_MPI=Off -WITH_OPENMP=Off ..
make

[2023-11-14 17:31:25: INFO] 
==================================================
Processing step 'run-lulesh'
==================================================

[2023-11-14 17:31:25: INFO] 
==================================================
Expanding step 'run-lulesh'
==================================================
-------- Used Parameters --------
{'SIZE', 'ITERATIONS'}
---------------------------------
[2023-11-14 17:31:25: INFO] 
**********************************
Combo [SIZE.100.ITER.10]
**********************************
[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 10 -p > SIZE.100.ITER.10.log

[2023-11-14 17:31:25: INFO] New cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 10 -p > SIZE.100.ITER.10.log

[2023-11-14 17:31:25: INFO] Processing regular dependencies.
[2023-11-14 17:31:25: INFO] Adding edge (make-lulesh, run-lulesh_ITER.10.SIZE.100)...
[2023-11-14 17:31:25: INFO] 
**********************************
Combo [SIZE.100.ITER.20]
**********************************
[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 20 -p > SIZE.100.ITER.20.log

[2023-11-14 17:31:25: INFO] New cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 20 -p > SIZE.100.ITER.20.log

[2023-11-14 17:31:25: INFO] Processing regular dependencies.
[2023-11-14 17:31:25: INFO] Adding edge (make-lulesh, run-lulesh_ITER.20.SIZE.100)...
[2023-11-14 17:31:25: INFO] 
**********************************
Combo [SIZE.100.ITER.30]
**********************************
[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 30 -p > SIZE.100.ITER.30.log

[2023-11-14 17:31:25: INFO] New cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 30 -p > SIZE.100.ITER.30.log

[2023-11-14 17:31:25: INFO] Processing regular dependencies.
[2023-11-14 17:31:25: INFO] Adding edge (make-lulesh, run-lulesh_ITER.30.SIZE.100)...
[2023-11-14 17:31:25: INFO] Directory does not exist. Creating directories to /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/meta
[2023-11-14 17:31:25: INFO] Directory does not exist. Creating directories to /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/meta/study
[2023-11-14 17:31:25: INFO] Checking DAG status at 2023-11-14 17:31:25.588449

# Here is where things start to get weird - at first we have a flux_uri, but then not?
[2023-11-14 17:31:25: INFO] Found FLUX_URI in environment, scheduling jobs to broker uri local:///mnt/flux/view/run/flux/local
[2023-11-14 17:31:25: INFO] No FLUX_URI; scheduling standalone batch job to root instance
Traceback (most recent call last):
  File "/usr/local/bin/maestro", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/maestro.py", line 507, in main
    rc = args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/maestro.py", line 341, in run_study
    completion_status = conductor.monitor_study()
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/conductor.py", line 352, in monitor_study
    completion_status = dag.execute_ready_steps()
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/datastructures/core/executiongraph.py", line 734, in execute_ready_steps
    adapter = adapter(**self._adapter)
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/interfaces/script/fluxscriptadapter.py", line 116, in __init__
    self._broker_version = self._interface.get_flux_version()
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/abstracts/interfaces/flux.py", line 75, in get_flux_version
    cls.connect_to_flux()
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/abstracts/interfaces/flux.py", line 22, in connect_to_flux
    cls.flux_handle = flux.Flux()
NameError: name 'flux' is not defined
broker.err[0]: rc2.0: maestro run -fg ./lulesh-flux.yaml -y Exited (rc=1) 1.8s
broker.info[0]: rc2-fail: run->cleanup 1.75864s
broker.info[0]: cleanup.0: flux queue stop --quiet --all --nocheckpoint Exited (rc=0) 0.1s
broker.info[0]: cleanup.1: flux cancel --user=all --quiet --states RUN Exited (rc=0) 0.1s
broker.info[0]: cleanup.2: flux queue idle --quiet Exited (rc=0) 0.1s
broker.info[0]: cleanup-success: cleanup->shutdown 0.278262s
broker.info[0]: children-complete: shutdown->finalize 95.1042ms
broker.info[0]: rc3.0: /opt/software/linux-ubuntu20.04-x86_64/gcc-9.4.0/flux-core-0.54.0-dxw2ljadlubcgovsrnvijkrkywlgk2ex/etc/flux/rc3 Exited (rc=0) 0.2s
broker.info[0]: rc3-success: finalize->goodbye 0.17189s
broker.info[0]: goodbye: goodbye->exit 0.046488ms

Can we talk about what the steps / flow of logic is between that first FLUX_URI being found and the second? If the second isn't finding flux because the PYTHONPATH isn't being passed forward, that might be the bug?

For context, here is the workflow I'm running:

description:
    name: lulesh_sample1
    description: A sample LULESH study that downloads, builds, and runs a parameter study of varying problem sizes and iterations on FLUX.

env:
    variables:
        OUTPUT_PATH: ./studies/lulesh

    labels:
        outfile: $(SIZE.label).$(ITERATIONS.label).log

    dependencies:
      git:
        - name: LULESH
          path: $(OUTPUT_PATH)
          url: https://github.com/LLNL/LULESH.git

batch:
    type        : flux

study:
    - name: make-lulesh
      description: Build the MPI enabled version of LULESH.
      run:
          cmd: |
            cd $(LULESH)
            mkdir build
            cd build
            cmake -WITH_MPI=Off -WITH_OPENMP=Off ..
            make
          depends: []

    - name: run-lulesh
      description: Run LULESH.
      run:
          cmd: |
            $(LAUNCHER) $(LULESH)/build/lulesh2.0 -s $(SIZE) -i $(ITERATIONS) -p > $(outfile)
          depends: [make-lulesh]
          nodes: 1
          procs: 1
          cores per task: 1
          nested: True
          priority: high
          walltime: "00:60:00"

# Note that I reduced these sizes for a single container run
global.parameters:
    SIZE:
        values  : [100, 100, 100]
        # values  : [100, 100, 100, 200, 200, 200, 300, 300, 300]
        label   : SIZE.%%
    ITERATIONS:
        # values  : [10, 20, 30, 10, 20, 30, 10, 20, 30]
        values  : [10, 20, 30]
        label   : ITER.%%
@jwhite242
Copy link
Collaborator

Well, it looks like the FLUX_URI info messages are both a different bug and a red herring -> the second one should be wrapped in an else as it gets called regardless of whether a uri was found or not! Thanks for pointing that out! (see offending code here:

LOGGER.info(f"No FLUX_URI; scheduling standalone batch job to root instance")
)

The import error looks to be the real problem here. Not sure how the pythonpath would get changed once maestro starts running, so it might just be borked from the start? The uri gets sourced before it tries connecting to any broker (via the env var), so don't think that tells us much about the state of things :/.

Can you say more about how this is being run? is this launching a container, building the maestro env and then running maestro inside it, or are you trying to have maestro external to the container schedule to a broker living inside it? (if the latter, that'll be a new one for me for how to wire up the bindings...). Could you maybe try just having a non-maestro script talk to the python bindings in place of the call to 'maestro run -fg ...' and see if that can import flux so we can rule out an environment issue?

@vsoch
Copy link
Author

vsoch commented Nov 14, 2023

Can you say more about how this is being run? is this launching a container, building the maestro env and then running maestro inside it, or are you trying to have maestro external to the container schedule to a broker living inside it? (if the latter, that'll be a new one for me for how to wire up the bindings...). Could you maybe try just having a non-maestro script talk to the python bindings in place of the call to 'maestro run -fg ...' and see if that can import flux so we can rule out an environment issue?

Sure! So I have this container: https://github.com/rse-ops/hpc-apps/blob/main/maestro-lulesh/Dockerfile and we are using the flux operator with a new design that we stay in that container (and don't need flux there). Flux is added as an on demand view, and when we run the maestro command, this:

maestro run -fg ./lulesh-flux.yaml -y

I confirmed that PYTHONPATH had both the path for maestro (which is in dist-packages) and for flux. Same for the path. So to answer your question, we are just running a container with maestro and flux added on the fly, but the flux python paths are different than where maestro is installed (and that might be the issue). I could try installing maestro to the same flux python emnvironment if that might help debug?

@jwhite242
Copy link
Collaborator

Hmmm. I don't know that you need full flux itself installed in the same env as maestro but rather just the bindings. Can you add an extra bit in that container to also pip install flux-python==0.51.0? (or maybe some custom newer wheel for that or whatever)? Maybe that'll be a bit more reliable than the manual pythonpath hacking?

@vsoch
Copy link
Author

vsoch commented Nov 14, 2023

Can you add an extra bit in that container to also pip install flux-python==0.51.0

Yep will try that and report back.

@vsoch
Copy link
Author

vsoch commented Nov 14, 2023

okay that won't work because we don't have flux-security in these containers. Long story, but it was easier to not have the whole munge thing.

Collecting flux-python==0.51.0
  Using cached flux-python-0.51.0.tar.gz (217 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      Cannot find flux security under expected path /mnt/flux/view/include/flux/security
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

@vsoch
Copy link
Author

vsoch commented Nov 14, 2023

I can look into if we can ease that requirement.

@jwhite242
Copy link
Collaborator

jwhite242 commented Nov 14, 2023

Ok, how about this more immediate bandaid:

Get one of the required dependencies for the bindings:
pip install cffi

Wire up the python path automatically using awk and flux's exported environment vars:
export PYTHONPATH=$PYTHONPATH:`flux env | awk -F "[= ]" '{if ($2 == "PYTHONPATH") {env=$3; split(env, p, "\""); print p[2]}}'`

That was the old solution before the lovely python bindings were available separately.

EDIT: remove obnoxious heading formatting

@vsoch
Copy link
Author

vsoch commented Nov 14, 2023

haha thank you! I should have a tweaked pip install version pushed shortly, so I want to try that first.

@vsoch
Copy link
Author

vsoch commented Nov 14, 2023

Update: install is working:
image

@jwhite242
Copy link
Collaborator

sweet!

@vsoch
Copy link
Author

vsoch commented Nov 14, 2023

okay making tiny progress - my container was missing cmake and mpich, so likely this results from a worker node (pod) not having that. I'm rebuilding and will test again after that (for this local one I just installed them on one pod).
image

@vsoch
Copy link
Author

vsoch commented Nov 14, 2023

okay with rebuild - things started out OK:

image

but still resulted in:

image

What should we try next?

@jwhite242
Copy link
Collaborator

Hmm.. maybe try turning debug logging on: maestro -d 1 run ...? Will give slightly more info on the broker it's talking to before that error gets caught. But we might need to instrument things a bit more and/or resort to trying to submit the batch script via the python bindings manually: this error looks like flux itself isn't liking something in the job spec it's being given? This is where it's getting thrown so you can see the flux api commands being used.

Additionally, what about trying a serial spec to see if theres mpi related issues in here somehow: https://github.com/LLNL/maestrowf/blob/develop/samples/hello_world/hello_bye_parameterized_flux.yaml ?

@vsoch
Copy link
Author

vsoch commented Nov 15, 2023

Here is the output with -d
image

@jwhite242
Copy link
Collaborator

Well, that's not really a readable handle address is it...

As for the why it's choking, can you write up a standalone script using the same from nest command and see if we can even submit the job script manually? Not entirely familiar with the part of the stack trace inside flux, so i'll need to go dig around in there to see what might trigger that (maybe a malformed job script/spec?) as maestro's exceptions don't really say much about what's going on.

@vsoch
Copy link
Author

vsoch commented Nov 15, 2023

I think I'm probably going to drop working on this for now, at least until someone else requests the example to work again, because it's one workflow of (a potential) many that would be nice to have working as it did before, but shouldn't block to working on other things. Let's leave the issue open in case anyone else has insights. Thanks for the help @jwhite242 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants