Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request for debug support: loosing tasks #1354

Open
andre-merzky opened this issue Mar 3, 2025 · 6 comments
Open

request for debug support: loosing tasks #1354

andre-merzky opened this issue Mar 3, 2025 · 6 comments

Comments

@andre-merzky
Copy link

andre-merzky commented Mar 3, 2025

Hi Fluxies,

I am using a spack installed flux scheduler on Frontier, running 10k tasks (a shell script which, essentially, runs /bin/true as test workload) across 10 nodes. Setup works fine, performance is good, but I am loosing tasks, at a rate of about 0.01%. That is not much, but the randomness worries me, and I am worried to randomly stall the production workflow at scale.

What I observe is the following (number from a specific run, stats are comparable across the ~20 test runs I did): for a completing task, I capture the following flux events:

$ grep ƒhXJ4SdD events.log
 flux event: ƒhXJ4SdD: submit [None]
 flux event: ƒhXJ4SdD: depend [None]
 flux event: ƒhXJ4SdD: alloc [AGENT_EXECUTING_PENDING]
 flux event: ƒhXJ4SdD: start [AGENT_EXECUTING]
 flux event: ƒhXJ4SdD: finish [AGENT_STAGING_OUTPUT_PENDING]
 flux event: ƒhXJ4SdD: release [unschedule]
 flux event: ƒhXJ4SdD: free [None]
 flux event: ƒhXJ4SdD: clean [None]

For a task which I seem to 'loose' (I am not even sure if that is the right term), I see the following events:

$ grep ƒhXJ4SdE events.log
 flux event: ƒhXJ4SdE: submit [None]
 flux event: ƒhXJ4SdE: depend [None]
 flux event: ƒhXJ4SdE: alloc [AGENT_EXECUTING_PENDING]
 flux event: ƒhXJ4SdE: start [AGENT_EXECUTING]

So the task seems to be starting all right - and in fact I see stdout and stderr files appear in the task sandbox (located on /lustre/orion/chm155/scratch). However, all created tasks remain empty.

I am not sure if that is a Flux problem really, it could just as well be a file system issue - but I am at a loss on how to debug this further. Thus my question: are there any further flux logs or debug settings I could use to get more information about the tasks? Does flux perform any heartbeat checks or resource consumption checks on started tasks? Any other ideas on how to approach this?

Thanks for any feedback - Andre.

@solofoA45
Copy link

@andre-merzky : could you provide details (test script, method to detect "lost" tasks)? I may be able to reproduce it on my test cluster.

@andre-merzky
Copy link
Author

Ack - I can try to create a standalone reproducer, but may need some days - the flux code is a bit scattered across our stack.

@solofoA45
Copy link

@andre-merzky don't bother if it's too much work, I thought it would be something like:
time flux submit --cc 1-10000 --wait --progress /bin/true

@grondo
Copy link
Contributor

grondo commented Mar 3, 2025

@andre-merzky how are you submitting and monitoring the events of your Flux jobs? (e.g. flux.job.submit_async and flux.job.event_watch_async, or FluxExecutor?)

If the Flux instance of interesting is still around, you can check that the job eventlog matches what you're seeing above, e.g. flux job eventlog -H ƒhXJ4SdE. If it still shows the last event as start, then Flux thinks the job is still running. You can get the nodes on which it thinks the job is still active with flux jobs ƒhXJ4SdE and log in and see if that is true. If the job eventlog has finish and clean events, though, then Flux thinks the job has finished and we'll have to debug why the interface you are using to watch events is dropping some events.

@andre-merzky
Copy link
Author

andre-merzky commented Mar 3, 2025

@andre-merzky don't bother if it's too much work, I thought it would be something like: time flux submit --cc 1-10000 --wait --progress /bin/true

Alas it is a bit more than that :-) I started to do that anyway, as a standalone test, but not sure if that will be ready soon.

@andre-merzky how are you submitting and monitoring the events of your Flux jobs? (e.g. flux.job.submit_async and flux.job.event_watch_async, or FluxExecutor?)

We start flux from some python code with

srun -n 10 -N 10 --ntasks-per-node 1 --cpus-per-task=56 --gpus-per-task=8 --export=ALL flux start bash -c echo "HOST:$(hostname) URI:$FLUX_URI" && sleep inf

then capture the echo'ed flux URI, and then some other processes gets an executor instances connected to that URI:

    flux_job = import_module('flux.job')
    args = {'url': self._uri}
    exe  = flux_job.executor.FluxExecutor(handle_kwargs=args)

The job submission code is, simplified:

def submit_jobs(uri  : str,
                specs: List[Dict[str, Any]],
                cb   : Callable[[str, Any], None]) -> Any:

    def app_cb(fut, _, event):
        cb(fut.jobid(), event)

    futures = list()
    def id_cb(fut):
        flux_id = fut.jobid()
        for ev in ['alloc', 'start', 'finish', 'release', 'exception']:
            # 'submit', 'free', 'clean',
            fut.add_event_callback(ev, app_cb)
        futures.append([flux_id, fut])

    for idx, spec in enumerate(specs):
        fut = exe.submit(spec)
        fut.ru_idx = idx  # keep track of submission order
        fut.add_jobid_callback(id_cb)

    # wait until we got job IDs for all submissions
    timeout = magic_number
    start   = time.time()
    while len(futures) < len(specs):
        time.sleep(0.1)
        if time.time() - start > timeout:
            raise RuntimeError('timeout on job submission')

    # get flux_ids mapped to submission index
    flux_ids = [fut[0] for fut in sorted(futures, key=lambda x: x[1].ru_idx)]

    return flux_ids

(BTW: that code includes some ugliness to make sure the captured flux IDs are correctly mapped to the submitted specs, not sure if there is a better way to do that. Or can I rely on the jobid callbacks to be called in the same order the specs are submitted?)

If the Flux instance of interesting is still around, you can check that the job eventlog matches what you're seeing above, e.g. flux job eventlog -H ƒhXJ4SdE. If it still shows the last event as start, then Flux thinks the job is still running. You can get the nodes on which it thinks the job is still active with flux jobs ƒhXJ4SdE and log in and see if that is true. If the job eventlog has finish and clean events, though, then Flux thinks the job has finished and we'll have to debug why the interface you are using to watch events is dropping some events.

Ah, great, I'll use that! Is is also possible to obtain the process ID of the started job?

Thanks both!

@grondo
Copy link
Contributor

grondo commented Mar 3, 2025

BTW: that code includes some ugliness to make sure the captured flux IDs are correctly mapped to the submitted specs, not sure if there is a better way to do that. Or can I rely on the jobid callbacks to be called in the same order the specs are submitted?

I don't think you can rely on ordering with the FluxExecutor since submissions could use multiple threads with multiple Flux handles, and according to the docs the callback may be called in another thread. @jameshcorbett is probably most familiar with the FluxExecutor interface and may have more to add here.

Ah, great, I'll use that! Is is also possible to obtain the process ID of the started job?

There is actually a command flux job hostpids which will print a comma-separated list of host:PID pairs for a job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants