request for debug support: loosing tasks #1354

andre-merzky · 2025-03-03T11:19:43Z

Hi Fluxies,

I am using a spack installed flux scheduler on Frontier, running 10k tasks (a shell script which, essentially, runs /bin/true as test workload) across 10 nodes. Setup works fine, performance is good, but I am loosing tasks, at a rate of about 0.01%. That is not much, but the randomness worries me, and I am worried to randomly stall the production workflow at scale.

What I observe is the following (number from a specific run, stats are comparable across the ~20 test runs I did): for a completing task, I capture the following flux events:

$ grep ƒhXJ4SdD events.log
 flux event: ƒhXJ4SdD: submit [None]
 flux event: ƒhXJ4SdD: depend [None]
 flux event: ƒhXJ4SdD: alloc [AGENT_EXECUTING_PENDING]
 flux event: ƒhXJ4SdD: start [AGENT_EXECUTING]
 flux event: ƒhXJ4SdD: finish [AGENT_STAGING_OUTPUT_PENDING]
 flux event: ƒhXJ4SdD: release [unschedule]
 flux event: ƒhXJ4SdD: free [None]
 flux event: ƒhXJ4SdD: clean [None]

For a task which I seem to 'loose' (I am not even sure if that is the right term), I see the following events:

$ grep ƒhXJ4SdE events.log
 flux event: ƒhXJ4SdE: submit [None]
 flux event: ƒhXJ4SdE: depend [None]
 flux event: ƒhXJ4SdE: alloc [AGENT_EXECUTING_PENDING]
 flux event: ƒhXJ4SdE: start [AGENT_EXECUTING]

So the task seems to be starting all right - and in fact I see stdout and stderr files appear in the task sandbox (located on /lustre/orion/chm155/scratch). However, all created tasks remain empty.

I am not sure if that is a Flux problem really, it could just as well be a file system issue - but I am at a loss on how to debug this further. Thus my question: are there any further flux logs or debug settings I could use to get more information about the tasks? Does flux perform any heartbeat checks or resource consumption checks on started tasks? Any other ideas on how to approach this?

Thanks for any feedback - Andre.

The text was updated successfully, but these errors were encountered:

solofoA45 · 2025-03-03T12:37:59Z

@andre-merzky : could you provide details (test script, method to detect "lost" tasks)? I may be able to reproduce it on my test cluster.

andre-merzky · 2025-03-03T12:40:28Z

Ack - I can try to create a standalone reproducer, but may need some days - the flux code is a bit scattered across our stack.

solofoA45 · 2025-03-03T12:50:02Z

@andre-merzky don't bother if it's too much work, I thought it would be something like:
time flux submit --cc 1-10000 --wait --progress /bin/true

grondo · 2025-03-03T14:41:43Z

@andre-merzky how are you submitting and monitoring the events of your Flux jobs? (e.g. flux.job.submit_async and flux.job.event_watch_async, or FluxExecutor?)

If the Flux instance of interesting is still around, you can check that the job eventlog matches what you're seeing above, e.g. flux job eventlog -H ƒhXJ4SdE. If it still shows the last event as start, then Flux thinks the job is still running. You can get the nodes on which it thinks the job is still active with flux jobs ƒhXJ4SdE and log in and see if that is true. If the job eventlog has finish and clean events, though, then Flux thinks the job has finished and we'll have to debug why the interface you are using to watch events is dropping some events.

andre-merzky · 2025-03-03T18:32:21Z

@andre-merzky don't bother if it's too much work, I thought it would be something like: time flux submit --cc 1-10000 --wait --progress /bin/true

Alas it is a bit more than that :-) I started to do that anyway, as a standalone test, but not sure if that will be ready soon.

@andre-merzky how are you submitting and monitoring the events of your Flux jobs? (e.g. flux.job.submit_async and flux.job.event_watch_async, or FluxExecutor?)

We start flux from some python code with

srun -n 10 -N 10 --ntasks-per-node 1 --cpus-per-task=56 --gpus-per-task=8 --export=ALL flux start bash -c echo "HOST:$(hostname) URI:$FLUX_URI" && sleep inf

then capture the echo'ed flux URI, and then some other processes gets an executor instances connected to that URI:

    flux_job = import_module('flux.job')
    args = {'url': self._uri}
    exe  = flux_job.executor.FluxExecutor(handle_kwargs=args)

The job submission code is, simplified:

def submit_jobs(uri  : str,
                specs: List[Dict[str, Any]],
                cb   : Callable[[str, Any], None]) -> Any:

    def app_cb(fut, _, event):
        cb(fut.jobid(), event)

    futures = list()
    def id_cb(fut):
        flux_id = fut.jobid()
        for ev in ['alloc', 'start', 'finish', 'release', 'exception']:
            # 'submit', 'free', 'clean',
            fut.add_event_callback(ev, app_cb)
        futures.append([flux_id, fut])

    for idx, spec in enumerate(specs):
        fut = exe.submit(spec)
        fut.ru_idx = idx  # keep track of submission order
        fut.add_jobid_callback(id_cb)

    # wait until we got job IDs for all submissions
    timeout = magic_number
    start   = time.time()
    while len(futures) < len(specs):
        time.sleep(0.1)
        if time.time() - start > timeout:
            raise RuntimeError('timeout on job submission')

    # get flux_ids mapped to submission index
    flux_ids = [fut[0] for fut in sorted(futures, key=lambda x: x[1].ru_idx)]

    return flux_ids

(BTW: that code includes some ugliness to make sure the captured flux IDs are correctly mapped to the submitted specs, not sure if there is a better way to do that. Or can I rely on the jobid callbacks to be called in the same order the specs are submitted?)

If the Flux instance of interesting is still around, you can check that the job eventlog matches what you're seeing above, e.g. flux job eventlog -H ƒhXJ4SdE. If it still shows the last event as start, then Flux thinks the job is still running. You can get the nodes on which it thinks the job is still active with flux jobs ƒhXJ4SdE and log in and see if that is true. If the job eventlog has finish and clean events, though, then Flux thinks the job has finished and we'll have to debug why the interface you are using to watch events is dropping some events.

Ah, great, I'll use that! Is is also possible to obtain the process ID of the started job?

Thanks both!

grondo · 2025-03-03T19:33:02Z

BTW: that code includes some ugliness to make sure the captured flux IDs are correctly mapped to the submitted specs, not sure if there is a better way to do that. Or can I rely on the jobid callbacks to be called in the same order the specs are submitted?

I don't think you can rely on ordering with the FluxExecutor since submissions could use multiple threads with multiple Flux handles, and according to the docs the callback may be called in another thread. @jameshcorbett is probably most familiar with the FluxExecutor interface and may have more to add here.

Ah, great, I'll use that! Is is also possible to obtain the process ID of the started job?

There is actually a command flux job hostpids which will print a comma-separated list of host:PID pairs for a job.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

request for debug support: loosing tasks #1354

request for debug support: loosing tasks #1354

andre-merzky commented Mar 3, 2025 •

edited

Loading

solofoA45 commented Mar 3, 2025

andre-merzky commented Mar 3, 2025

solofoA45 commented Mar 3, 2025

grondo commented Mar 3, 2025

andre-merzky commented Mar 3, 2025 •

edited

Loading

grondo commented Mar 3, 2025 •

edited

Loading

request for debug support: loosing tasks #1354

request for debug support: loosing tasks #1354

Comments

andre-merzky commented Mar 3, 2025 • edited Loading

solofoA45 commented Mar 3, 2025

andre-merzky commented Mar 3, 2025

solofoA45 commented Mar 3, 2025

grondo commented Mar 3, 2025

andre-merzky commented Mar 3, 2025 • edited Loading

grondo commented Mar 3, 2025 • edited Loading

andre-merzky commented Mar 3, 2025 •

edited

Loading

andre-merzky commented Mar 3, 2025 •

edited

Loading

grondo commented Mar 3, 2025 •

edited

Loading