-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
request for debug support: loosing tasks #1354
Comments
@andre-merzky : could you provide details (test script, method to detect "lost" tasks)? I may be able to reproduce it on my test cluster. |
Ack - I can try to create a standalone reproducer, but may need some days - the flux code is a bit scattered across our stack. |
@andre-merzky don't bother if it's too much work, I thought it would be something like: |
@andre-merzky how are you submitting and monitoring the events of your Flux jobs? (e.g. If the Flux instance of interesting is still around, you can check that the job eventlog matches what you're seeing above, e.g. |
Alas it is a bit more than that :-) I started to do that anyway, as a standalone test, but not sure if that will be ready soon.
We start flux from some python code with srun -n 10 -N 10 --ntasks-per-node 1 --cpus-per-task=56 --gpus-per-task=8 --export=ALL flux start bash -c echo "HOST:$(hostname) URI:$FLUX_URI" && sleep inf then capture the echo'ed flux URI, and then some other processes gets an executor instances connected to that URI: flux_job = import_module('flux.job')
args = {'url': self._uri}
exe = flux_job.executor.FluxExecutor(handle_kwargs=args) The job submission code is, simplified: def submit_jobs(uri : str,
specs: List[Dict[str, Any]],
cb : Callable[[str, Any], None]) -> Any:
def app_cb(fut, _, event):
cb(fut.jobid(), event)
futures = list()
def id_cb(fut):
flux_id = fut.jobid()
for ev in ['alloc', 'start', 'finish', 'release', 'exception']:
# 'submit', 'free', 'clean',
fut.add_event_callback(ev, app_cb)
futures.append([flux_id, fut])
for idx, spec in enumerate(specs):
fut = exe.submit(spec)
fut.ru_idx = idx # keep track of submission order
fut.add_jobid_callback(id_cb)
# wait until we got job IDs for all submissions
timeout = magic_number
start = time.time()
while len(futures) < len(specs):
time.sleep(0.1)
if time.time() - start > timeout:
raise RuntimeError('timeout on job submission')
# get flux_ids mapped to submission index
flux_ids = [fut[0] for fut in sorted(futures, key=lambda x: x[1].ru_idx)]
return flux_ids (BTW: that code includes some ugliness to make sure the captured flux IDs are correctly mapped to the submitted specs, not sure if there is a better way to do that. Or can I rely on the jobid callbacks to be called in the same order the specs are submitted?)
Ah, great, I'll use that! Is is also possible to obtain the process ID of the started job? Thanks both! |
I don't think you can rely on ordering with the FluxExecutor since submissions could use multiple threads with multiple Flux handles, and according to the docs the callback may be called in another thread. @jameshcorbett is probably most familiar with the FluxExecutor interface and may have more to add here.
There is actually a command |
Hi Fluxies,
I am using a spack installed flux scheduler on Frontier, running 10k tasks (a shell script which, essentially, runs
/bin/true
as test workload) across 10 nodes. Setup works fine, performance is good, but I am loosing tasks, at a rate of about 0.01%. That is not much, but the randomness worries me, and I am worried to randomly stall the production workflow at scale.What I observe is the following (number from a specific run, stats are comparable across the ~20 test runs I did): for a completing task, I capture the following flux events:
For a task which I seem to 'loose' (I am not even sure if that is the right term), I see the following events:
So the task seems to be starting all right - and in fact I see stdout and stderr files appear in the task sandbox (located on
/lustre/orion/chm155/scratch
). However, all created tasks remain empty.I am not sure if that is a Flux problem really, it could just as well be a file system issue - but I am at a loss on how to debug this further. Thus my question: are there any further flux logs or debug settings I could use to get more information about the tasks? Does flux perform any heartbeat checks or resource consumption checks on started tasks? Any other ideas on how to approach this?
Thanks for any feedback - Andre.
The text was updated successfully, but these errors were encountered: