slurmctld log shows "error: Node XXXXXXXXXX appears to have a different slurm.conf than the slurmctld." #267

gwolski · 2024-10-19T16:34:15Z

I have started a new cluster using aws-eda-slurm-cluster with parallelcluster 3.11.0 (though I suspect it happens with any version as I have logs from 3.9.1 that suggest this happened there).

When I submit jobs, I get error messages on HeadNode's /var/log/slurmctld.log:

[2024-10-19T06:22:38.009] error: Node od-r7a-2xl-dy-od-r7a-2xl-2 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

It seems the head node starts up and I suspect some aws-eda-slurm-cluster configuration is happening post head_node startup. This is only a guess.

Launch time of my HeadNode is 2024/10/17 14:08 GMT -7, yet slurm.conf and other files in /opt/slurm/etc show modification times of Oct 17 14:15, e.g.:

$ ls -l *.conf
-rw-r--r-- 1 root root 249 Oct 17 14:15 cgroup.conf
-rw-r--r-- 1 root root 174 Oct 17 14:15 gres.conf
-rw-r--r-- 1 root root 2136 Oct 17 14:15 slurm.conf
-rw-r--r-- 1 root root 177 Oct 17 14:15 slurm_parallelcluster_cgroup.conf
-rw-r--r-- 1 root root 3703 Oct 17 14:15 slurm_parallelcluster.conf
-rw-r--r-- 1 root root 3270 Oct 17 14:15 slurm_parallelcluster_gres.conf
-rw-r--r-- 1 root root 168 Oct 17 14:15 slurm_parallelcluster_slurmdbd.conf

I thought it might have to do with the files being modified to include the pathname of the new cluster name in the files, but even files that don't have the cluster name, e.g. slurm_parallelcluster_slurmdbd.conf show 14:15 timestamp.

Nonetheless, I make the error message go away with the command:

sudo scontrol reconfigure

on the HeadNode.

I'm reporting this here vs. parallelcluster issues as I don't want to believe this is prevalent in standard parallelcluster deployment.

Reproduce:
start a new cluster
start a new job on the new cluster
observe the /var/log/slurmctld.log on the HeadNode.

cartalla · 2024-10-22T17:24:41Z

I think that what happened here is that you started a job before the configuration of the cluster is complete. After ParallelCluster configures the head node it calls additional custom scripts which modify the configuration. During this time window Slurm is running even though configuration isn't complete. The CloudFormation stack shouldn't complete until after the configuration is complete so it would be helpful if you also included when the CloudFormation stack completed. I think that configuration changes should be complete at that point.

Did the jobs themselves fail or did you only see the error in slurmctld.log?

gwolski · 2024-10-22T19:44:16Z

I did not start jobs before the cluster was finished configuring. No jobs fail. I saw this error message continually in logs for days after the cluster was up.
In fact, I started a new cluster last night with your latest changes of yesterday. I just submitted a new job. Here is the slurmctld.log output:

[2024-10-22T12:26:42.994] sched: _slurm_rpc_allocate_resources JobId=5 NodeList=sp-r7a-m-dy-sp-8-gb-1-cores-1 usec=505
[2024-10-22T12:26:56.001] POWER: no more nodes to resume for job JobId=5
[2024-10-22T12:26:56.001] POWER: power_save: waking nodes sp-r7a-m-dy-sp-8-gb-1-cores-1
[2024-10-22T12:30:02.908] auth/jwt: auth_p_token_generate: created token for root for 1800 seconds
[2024-10-22T12:30:02.916] auth/jwt: auth_p_token_generate: created token for root for 1800 seconds
[2024-10-22T12:31:09.580] error: Node sp-r7a-m-dy-sp-8-gb-1-cores-1 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2024-10-22T12:31:09.580] Node sp-r7a-m-dy-sp-8-gb-1-cores-1 now responding
[2024-10-22T12:31:09.580] POWER: Node sp-r7a-m-dy-sp-8-gb-1-cores-1/sp-r7a-m-dy-sp-8-gb-1-cores-1/10.6.12.12 powered up with instance_id=, instance_type=
[2024-10-22T12:31:25.000] job_time_limit: Configuration for JobId=5 complete
[2024-10-22T12:31:25.000] Resetting JobId=5 start time for node power up
[2024-10-22T12:32:00.003] error: Node sp-m7a-m-dy-sp-4-gb-1-cores-1 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

I still believe config file changes after the head_node slurmctld daemon is started. From the /var/log/slurmctld.log file, the slurm.conf file is read :

[2024-10-21T18:42:55.513] read_slurm_conf: backup_controller not specified

But the date on the slurm.conf file is after this:
$ ls -l --full-time /opt/slurm/etc
total 52
-rw-r--r-- 1 root root 249 2024-10-21 18:43:32.101919815 -0700 cgroup.conf
-rw-r--r-- 1 root root 174 2024-10-21 18:43:31.654917609 -0700 gres.conf
drwxr-xr-x 3 root root 8192 2024-10-21 18:43:32.081919717 -0700 pcluster
drwxr-xr-x 4 root root 38 2024-10-21 18:42:46.890696266 -0700 scripts
-rw-r--r-- 1 root root 2136 2024-10-21 18:43:31.649917584 -0700 slurm.conf
-rwxr-xr-x 1 root root 233 2024-10-21 18:42:46.839696012 -0700 slurm.csh
-rw-r--r-- 1 root root 177 2024-10-21 18:43:32.092919771 -0700 slurm_parallelcluster_cgroup.conf
-rw-r--r-- 1 root root 4383 2024-10-21 18:43:32.083919726 -0700 slurm_parallelcluster.conf
-rw-r--r-- 1 root root 3908 2024-10-21 18:43:32.088919751 -0700 slurm_parallelcluster_gres.conf
-rw-r--r-- 1 root root 168 2024-10-21 18:43:32.096919791 -0700 slurm_parallelcluster_slurmdbd.conf
-rwxr-xr-x 1 root root 140 2024-10-21 18:42:46.836695997 -0700 slurm.sh

As noted, the messages go away after a scontrol reconfigure.

This might be a native parallelcluster issue - I have not started a cluster with just pcluster, I always use aws-eda-slurm-cluster tools.

As noted, no jobs fail, it's just a a disconcerting message.

cartalla · 2024-10-22T21:07:33Z

Noted. Let me see if I can reproduce.

I definitely update the config after ParallelCluster has slurmctld up and running and I restart slurmctld. I didn't think I needed to do an scontrol reconfig unless there were running compute nodes, but maybe I do.
I should probably do that anyway since there could be static nodes that need to see the new config too.

Would be nice if Slurm would figure this out itself.

cartalla · 2024-10-23T20:05:26Z

I just confirmed this with a new cluster.


[2024-10-23T17:11:54.389] sched: _slurm_rpc_allocate_resources JobId=1 NodeList=sp-m8g-m-dy-sp-4-gb-1-cores-1 usec=2536
[2024-10-23T17:12:00.000] POWER: no more nodes to resume for job JobId=1
[2024-10-23T17:12:00.000] POWER: power_save: waking nodes sp-m8g-m-dy-sp-4-gb-1-cores-1
[2024-10-23T17:14:00.000] POWER: Power save mode: 399 nodes
[2024-10-23T17:16:30.240] error: Node sp-m8g-m-dy-sp-4-gb-1-cores-1 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2024-10-23T17:16:30.240] Node sp-m8g-m-dy-sp-4-gb-1-cores-1 now responding

Need to check if slurm conf changed after the most recent restart of slurmctld.

cartalla · 2024-10-23T21:14:34Z

There was a bug in the ansible task that was updating the slurm.conf that wasn't correctly detecting changes and restarting slurmctld.

Require at least 4 GB or else instance doesn't have enough memory. Update Lambdas to Python 3.12 from 3.9. Fix bug in Xio configuration. Resolves #268 Fix bug in ansible task that updates slurm.conf that didn't correctly detect changes and restart slurmctld. Resolves #267

cartalla closed this as completed in ada1a31 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slurmctld log shows "error: Node XXXXXXXXXX appears to have a different slurm.conf than the slurmctld." #267

slurmctld log shows "error: Node XXXXXXXXXX appears to have a different slurm.conf than the slurmctld." #267

gwolski commented Oct 19, 2024 •

edited

Loading

cartalla commented Oct 22, 2024

gwolski commented Oct 22, 2024

cartalla commented Oct 22, 2024

cartalla commented Oct 23, 2024

cartalla commented Oct 23, 2024

slurmctld log shows "error: Node XXXXXXXXXX appears to have a different slurm.conf than the slurmctld." #267

slurmctld log shows "error: Node XXXXXXXXXX appears to have a different slurm.conf than the slurmctld." #267

Comments

gwolski commented Oct 19, 2024 • edited Loading

cartalla commented Oct 22, 2024

gwolski commented Oct 22, 2024

cartalla commented Oct 22, 2024

cartalla commented Oct 23, 2024

cartalla commented Oct 23, 2024

gwolski commented Oct 19, 2024 •

edited

Loading