Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sadness upon loading sched 0.42.2 on tioga with flux-core 0.70.0 #1337

Open
garlick opened this issue Feb 11, 2025 · 2 comments
Open

sadness upon loading sched 0.42.2 on tioga with flux-core 0.70.0 #1337

garlick opened this issue Feb 11, 2025 · 2 comments

Comments

@garlick
Copy link
Member

garlick commented Feb 11, 2025

Problem: when loading sched v0.42.2 and coral2 v0.20.0 on tioga (without a restart), a some running jobs got fatal exceptions.

Here's the beginning of the logs (the latter bit repeats for several jobs)

2025-02-11T11:26:52.052125-08:00 broker.debug[0]: insmod sched-fluxion-resource
2025-02-11T11:26:52.052487-08:00 sched-fluxion-resource.info[0]: version 0.42.2
2025-02-11T11:26:52.052623-08:00 sched-fluxion-resource.debug[0]: mod_main: resource module starting
2025-02-11T11:26:52.135322-08:00 broker.debug[0]: insmod sched-fluxion-qmanager
2025-02-11T11:26:52.135699-08:00 sched-fluxion-qmanager.info[0]: version 0.42.2
2025-02-11T11:26:52.135833-08:00 sched-fluxion-qmanager.debug[0]: service_register
2025-02-11T11:26:52.135935-08:00 sched-fluxion-qmanager.debug[0]: enforced policy (queue=mi300a): easy
2025-02-11T11:26:52.135951-08:00 sched-fluxion-qmanager.debug[0]: effective queue params (queue=mi300a): queue-depth=1024
2025-02-11T11:26:52.135957-08:00 sched-fluxion-qmanager.debug[0]: effective policy params (queue=mi300a): default
2025-02-11T11:26:52.135962-08:00 sched-fluxion-qmanager.debug[0]: enforced policy (queue=pci): easy
2025-02-11T11:26:52.135967-08:00 sched-fluxion-qmanager.debug[0]: effective queue params (queue=pci): queue-depth=1024
2025-02-11T11:26:52.135971-08:00 sched-fluxion-qmanager.debug[0]: effective policy params (queue=pci): default
2025-02-11T11:26:52.135975-08:00 sched-fluxion-qmanager.debug[0]: enforced policy (queue=pdebug): easy
2025-02-11T11:26:52.135980-08:00 sched-fluxion-qmanager.debug[0]: effective queue params (queue=pdebug): queue-depth=1024
2025-02-11T11:26:52.135983-08:00 sched-fluxion-qmanager.debug[0]: effective policy params (queue=pdebug): default
2025-02-11T11:26:52.215956-08:00 sched-fluxion-resource.debug[0]: resource graph datastore loaded with JGF reader
2025-02-11T11:26:52.226920-08:00 sched-fluxion-resource.info[0]: populate_resource_db: loaded resources from core's resource.acquire
2025-02-11T11:26:52.228517-08:00 sched-fluxion-resource.debug[0]: resource status changed (rankset=[all] status=DOWN)
2025-02-11T11:26:52.228595-08:00 sched-fluxion-resource.debug[0]: resource status changed (rankset=[3-12,17-26,29-32] status=UP)
2025-02-11T11:26:52.228606-08:00 sched-fluxion-resource.debug[0]: mod_main: resource graph database loaded
2025-02-11T11:26:52.228730-08:00 sched-fluxion-qmanager.debug[0]: handshaking with sched-fluxion-resource completed
2025-02-11T11:26:52.228818-08:00 job-manager.debug[0]: scheduler: hello +partial-ok
2025-02-11T11:26:52.233857-08:00 sched-fluxion-resource.err[0]: run: dfu_traverser_t::run (id=1288213394146461696):
2025-02-11T11:26:52.233878-08:00 sched-fluxion-resource.err[0]: run_update: run: No such file or directory
2025-02-11T11:26:52.233888-08:00 sched-fluxion-resource.err[0]: update_request_cb: update failed (id=1288213394146461696): No such file or directory
2025-02-11T11:26:52.234029-08:00 sched-fluxion-qmanager.err[0]: jobmanager_hello_cb: reconstruct (id=1288213394146461696 queue=pdebug): No such file or directory
2025-02-11T11:26:52.234057-08:00 sched-fluxion-qmanager.info[0]: raising fatal exception on running job id=f3zSDASUCS6K
2025-02-11T11:26:52.234506-08:00 job-exec.debug[0]: exec aborted: id=f3zSDASUCS6K
2025-02-11T11:26:52.237352-08:00 sched-fluxion-resource.err[0]: run: dfu_traverser_t::run (id=1288009580382521344):
2025-02-11T11:26:52.237363-08:00 sched-fluxion-resource.err[0]: run_update: run: No such file or directory
2025-02-11T11:26:52.237367-08:00 sched-fluxion-resource.err[0]: update_request_cb: update failed (id=1288009580382521344): No such file or directory
2025-02-11T11:26:52.237512-08:00 sched-fluxion-qmanager.err[0]: jobmanager_hello_cb: reconstruct (id=1288009580382521344 queue=pdebug): No such file or directory
2025-02-11T11:26:52.237537-08:00 sched-fluxion-qmanager.info[0]: raising fatal exception on running job id=f3zQcrbdohXd
2025-02-11T11:26:52.237897-08:00 job-exec.debug[0]: exec aborted: id=f3zQcrbdohXd

Furthermore, new job submissions were failing validation

$ flux alloc -N1 
flux-alloc: ERROR: Internal match error: Value of "type" must be a resource type known to fluxion

Downgrading to v0.42.0 had the same issue.
Downgrading to v0.41.0 resolved the matter.

@milroy
Copy link
Member

milroy commented Feb 22, 2025

A possibly related observation. While debugging an unrelated issue, I noticed the "Value of "type" must be a resource type known to fluxion" message in the logs for t3013-resource-unsat. In this case the reason is the zone resource in t/data/resource/jobspecs/satisfiability/test009.yaml:

version: 9999
resources:
  - type: zone
    count: 1
    with:
      - type: cluster
        count: 1
        with:
          - type: rack
            count: 1
            with:
              - type: node
                count: 1
                with:
                  - type: slot
                    count: 1
                    label: default
                    with:
                      - type: socket
                        count: 1
                        with:
                          - type: core
                            count: 1

Note that zone is not a defined resource_type_t.

@milroy
Copy link
Member

milroy commented Feb 22, 2025

After further thought, I don't think the above observation is related to the incident on tioga. In the off chance it is I'll leave the comment posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants