Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed jobs due to missing DW environment within flux allocation #194

Open
mcfadden8 opened this issue Aug 8, 2024 · 4 comments
Open

Failed jobs due to missing DW environment within flux allocation #194

mcfadden8 opened this issue Aug 8, 2024 · 4 comments

Comments

@mcfadden8
Copy link

mcfadden8 commented Aug 8, 2024

General Problem: When running sequentially submitting 532 single-node jobs to 532 nodes on the El Cap iotesting queue, I ran in to two problems. The good news is that 523 jobs successfully ran. But, 6 jobs reported an error and 3 jobs were killed by signal. This thread pertains to the 6 jobs reporting an error. (This is reproducible)

@bdevcich
Copy link
Contributor

Marty, can you provide more detail here or can this be closed?

@mcfadden8
Copy link
Author

@mcfadden8
Copy link
Author

Focusing in on 1 of the 3 jobs that were killed, I see the following from flux:

flux job info f2BVr7EH4GB9 eventlog
{"timestamp":1723142411.257493,"name":"submit","context":{"userid":54987,"urgency":16,"flags":0,"version":1}}
{"timestamp":1723142411.4714682,"name":"validate"}
{"timestamp":1723142411.7227607,"name":"dependency-add","context":{"description":"dws-create"}}
{"timestamp":1723142469.0961509,"name":"memo","context":{"rabbit_workflow":"fluxjob-508774988934663168"}}
{"timestamp":1723142484.6505346,"name":"dependency-remove","context":{"description":"dws-create"}}
{"timestamp":1723142484.6505933,"name":"depend"}
{"timestamp":1723142484.6506846,"name":"priority","context":{"priority":16}}
{"timestamp":1723142484.890485,"name":"alloc","context":{"annotations":{"user":{"rabbit_workflow":"fluxjob-508774988934663168"}}}}
{"timestamp":1723142484.8907204,"name":"prolog-start","context":{"description":"job-manager.prolog"}}
{"timestamp":1723142484.8907382,"name":"prolog-start","context":{"description":"cray-pals-port-distributor"}}
{"timestamp":1723142484.8907464,"name":"prolog-start","context":{"description":"dws-setup"}}
{"timestamp":1723142484.9701443,"name":"prolog-finish","context":{"description":"cray-pals-port-distributor","status":0}}
{"timestamp":1723142531.349112,"name":"memo","context":{"rabbits":"elcap438"}}
{"timestamp":1723142657.4949949,"name":"exception","context":{"type":"exception","severity":0,"note":"DWS/Rabbit interactions failed: workflow in 'TransientCondition' state too long: None","userid":765}}
{"timestamp":1723142657.4951036,"name":"prolog-finish","context":{"description":"dws-setup","status":1}}
{"timestamp":1723142657.4951787,"name":"epilog-start","context":{"description":"dws-epilog"}}
{"timestamp":1723142658.4516304,"name":"exception","context":{"type":"prolog","severity":0,"note":"prolog killed by signal 15 (timeout or job canceled)","userid":765}}
{"timestamp":1723142658.4516723,"name":"prolog-finish","context":{"description":"job-manager.prolog","status":36608}}
{"timestamp":1723142710.1166995,"name":"epilog-finish","context":{"description":"dws-epilog","status":0}}
{"timestamp":1723142710.1171415,"name":"free"}
{"timestamp":1723142710.1171782,"name":"clean"}

Flux created no output file for stdout and stderr

@mcfadden8
Copy link
Author

Looking at the logs associated with this job, I see:

grep 508774988934663168 * | grep -i ERROR
compute.journalctl.elcap4795:Aug 08 11:43:32 elcap4795 clientmountd[47577]: 2024-08-08T11:43:32-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:43:32 elcap4795 clientmountd[47577]: 2024-08-08T11:43:32-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:43:32 elcap4795 clientmountd[47577]: 2024-08-08T11:43:32-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "d567d7e2-6da0-4f80-bbc9-dc3c7413f422", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:43:52 elcap4795 clientmountd[47577]: 2024-08-08T11:43:52-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:43:52 elcap4795 clientmountd[47577]: 2024-08-08T11:43:52-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:43:52 elcap4795 clientmountd[47577]: 2024-08-08T11:43:52-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "6439d0d0-a995-4f5a-9d0c-d64213aa7dd7", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:02 elcap4795 clientmountd[47577]: 2024-08-08T11:44:02-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:44:02 elcap4795 clientmountd[47577]: 2024-08-08T11:44:02-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:02 elcap4795 clientmountd[47577]: 2024-08-08T11:44:02-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "58c7cc0c-076b-4a3a-9391-dd14c857ef9c", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:12 elcap4795 clientmountd[47577]: 2024-08-08T11:44:12-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:44:12 elcap4795 clientmountd[47577]: 2024-08-08T11:44:12-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:12 elcap4795 clientmountd[47577]: 2024-08-08T11:44:12-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "e71e1049-de3e-4eb4-b36d-37f63851a7ba", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:23 elcap4795 clientmountd[47577]: 2024-08-08T11:44:23-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:44:23 elcap4795 clientmountd[47577]: 2024-08-08T11:44:23-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:23 elcap4795 clientmountd[47577]: 2024-08-08T11:44:23-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "4f72d7a3-7926-4117-b99f-7527431b6d89", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:33 elcap4795 clientmountd[47577]: 2024-08-08T11:44:33-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:44:33 elcap4795 clientmountd[47577]: 2024-08-08T11:44:33-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:33 elcap4795 clientmountd[47577]: 2024-08-08T11:44:33-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "dbebe05a-1509-45f3-9822-51b1e721f4c8", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:43 elcap4795 clientmountd[47577]: 2024-08-08T11:44:43-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:44:43 elcap4795 clientmountd[47577]: 2024-08-08T11:44:43-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:43 elcap4795 clientmountd[47577]: 2024-08-08T11:44:43-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "d13fe73a-7bf0-40cc-9d55-158026440999", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
grep: rabbit.pods.elcap438: Is a directory

Note: I had to snip out some messages in the middle of the grep above in order to fit in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Open
Development

No branches or pull requests

2 participants