-
Notifications
You must be signed in to change notification settings - Fork 0
Troubleshooting guide
Kjetil Klepper edited this page Apr 30, 2024
·
110 revisions
👉 Note that the Galaxy training network has also published their own troubleshooting guide!
- Go to the Jobs admin page in Galaxy and check that the job dispatching is not locked under Job Control.
- Check the Slurm queue with the
squeue
command. If a job is listed as being in the "pending" state, it means that there is not enough cores or memory currently available to run the job. So, it will just have to wait until other jobs have completed first and enough resources have been released. - Check the status of the compute nodes with
sinfo
. If a node is in the drained or down states, it will not start any new jobs. You can undrain a node with the command:sudo
scontrol update nodename=XXX state=resume
, where XXX is the name of the node (e.g. "slurm.usegalaxy.no" or "ecc2.usegalaxy.no"). - There could be other problems with Slurm as well. Check out this troubleshooting guide
- Check that the directory "/srv/galaxy/server/dependencies" has file rights
drwxr-xr-x
(755). If this is not the case, you will have to change it withchmod 755 /srv/galaxy/server/dependencies
. (We also have a cron-job which will set these files rights regularly if they are wrong) - If a tool has never been run before and has a unique set of dependency requirements, Galaxy will have to build a Singularity container with the required packages before running the tool. This will take time and the job will be shown as queued until this step is finished. You can check the logs to see if the container is being built. If Galaxy is unable to build a Singularity container with the required dependencies, the job can be stuck forever in a queued state. This can for instance happen if the tool wrapper relies on old style tool shed packages (defined in a
tool_dependencies.xml
file) rather than Conda packages. Examine the tool wrapper to see if this is the case. - Sometimes a job needs to pre-process a dataset before running (for instance to uncompress the file or convert it to another format). This can potentially take a long time and the job will not be shown as "running" until this step is finished. If this is the reason the job is still "queued", you just have to wait until it is done. (You should be able to see the hidden pre-processing job in the "Jobs" list on the Admin page.)
- If a tool has specified that it needs a lot of resources to run (e.g. CPU cores or memory), the job will be stuck in the queue until those resources are available. If no compute node is able to meet the requirements, the job will never start. You can force Slurm to run the job anyway by lowering the resources for that job with the command:
scontrol update job=<JobID> numcpus=<CORES> minmemorynode=<MB>
. - Note that the test server uses the same
tool_destinations.yaml
file as the production server to specify resource requirements, but the compute nodes on the test stack have less resources. This means that some tools that work on the production server will not start on the test server, because the nodes do not have enough memory/CPU. To run these tools on the test server, you can modify the/srv/galaxy/dynamic_rules/usegalaxy/tool_destinations.yaml
file to lower the requirements of the tool or run thescontrol update
command mentioned above to adjust the resource requirements for a single job. - Some jobs that are in a "new" state may be part of a workflow and are just waiting for previous steps to complete before they can start.
- If "new" jobs are still not being started and are not added to the Slurm queue (particularly "upload" jobs), it means that Galaxy has failed to pick up these new jobs for some reason. Try restarting the Galaxy server with
sudo systemctl restart galaxy
.
- Due to issues with our current CVMFS setup, jobs that use reference datasets may take a very long time to complete. Until these problems are fixed, we just have to be patient when running these jobs.
- Login to the compute node where the jobs are running and check that the NFS-shared directories "/data/part0/" (which contain the datasets) and "/srv/galaxy/" are accessible. If they are not listed when you run
df
, you can try to remount these volumes with the commandsudo mount -a
. This should mount all the NFS-shares listed in "/etc/fstab" (assuming they are indeed included there). If the computer appears to hang when you access these directories or rundf
, you can try to restart the NFS service on the main node withsudo systemctl restart nfs
. If the directories are still not available, but they are available on another compute node, you should drain the problematic node and move the jobs onto another node with the requeue command explained below. Then contact another sysadmin to help you. - You can check the tool's output to STDERR and STDOUT to see if they provide any hints to why the job is taking so long. First find the Galaxy jobID of the job (not the jobID of the Slurm job), then run the command
sudo find /data/part0/tmp/jobs/ -type d -name <jobID>
to find the job's working directory. Go to the "outputs" subdirectory and look at the files named "tool_stderr" and "tool_stderr". - You can try to requeue a running job with the command
sudo scontrol requeue <job_id>
. This should put the job back in the pending state, and it will be rerun when resources become available (which may be on the same node or a different node). If the node the job is currently running on seems slow, you should force the requeued jobs onto a different node by draining the current node first. Note that after a job has been requeued, it may end up in a completing (CG) state and appear to be stuck there for a long time. But this should hopefully resolve itself after a while, and the job will be assigned a new BeginTime (which may be in the future. Seescontrol show job <jobID>
) - Check that all the Slurm services are running correctly with
systemctl status <service>
. This includes the Slurm Controller (slurmctld) and Slurm Database (slurmdbd) on the main node, and the Slurm Daemon (slurmd) services on every compute nodes. Restart a service withsudo systemctl restart <service>
, if needed. - If a job has been running suspiciously long on Slurm without completing, inspect the jobs via the Galaxy Admin page to check that the job in question is still in a "running" state in Galaxy. (If the job was started long ago, you may have to check the jobs via the reports server instead, since the Admin page may not list all jobs even if the cutoff time is set to a high value). If the job is listed as being in an "OK", "Error" or "Deleted" state instead of "Running", you should cancel the Slurm job manually with
sudo scancel <jobID>
. - If something terrible has happened to Slurm services during job execution, you may end up with runaway jobs. You can check this with the command
sudo sacctmgr show runaway
(which also gives you the option of fixing these jobs).
- If a job is part of a workflow and an upstream step fails with an error, the jobs coming after it will be placed in a PAUSED state.
- Jobs can also be paused if the user has exceeded their assigned disk quota. In this case, the user must first free up space by deleting datasets they don't need anymore, and then select "Resume Paused Jobs" from the history options menu (cogwheel) to start them up.
- If resumed jobs are still not being started by Galaxy (even if compute resources are available), you may have to restart the Galaxy server with
sudo systemctl restart galaxy
.
- First make absolutely sure that the job/process is not running at all, and that it's not just taking a long time to complete.
- If no other jobs are running on the same compute node, you should take down the node with
sudo scontrol update nodename=XXX state=down reason=hang_proc
and then bring it back up again withsudo scontrol update nodename=XXX state=resume
. If there are other jobs still running on the same node, you should drain the node first and wait for these jobs to complete before you take down the node. [Ref: Slurm Troubleshooting guide]
- You can (temporarily) stop a running job with
sudo systemctl requeuehold <jobid>
. It will be placed back in a pending state, so it can be run later instead, but not until you explicitly release it from its hold withsudo systemctl release <jobid>
. - If a job has not started yet because it is waiting for resources to become available, you can prevent it from starting with
sudo systemctl hold <jobid>
. It will not start until you release it.
If you run the sinfo
command and a node has a star behind its state, e.g. idle*
, it means that the node is not responding and will not be assigned jobs.
- Try
ping
ing the node to see if it responds, or log in to the node with SSH. If that does not work, there is something wrong with the node itself or the network. If the node can be accessed normally, there is something wrong with Slurm or the Slurm configuration. - Log into the compute node and make sure the Slurm Daemon is running with
systemctl status slurmd
. If this service is not active, start it withsudo systemctl start slurmd
. A restart of the service might also be beneficial sometimes. (Will that affect running jobs?) - Restarting the Slurm Controller (or even the Slurm Database) service on the main node (
usegalaxy.no
) can sometimes help:sudo systemctl restart slurmctl
(andsudo systemctl restart slurmdbd
). This should not affect running jobs. - Look into the logs to see if that can point you in the right direction. The logs can be found in
/var/log/slurm/
. Check both on the main node and the compute node. - Make sure the IP-address for the compute node is correct in
etc/slurm/slurm.conf
- Make sure that both the main node, which is running the Slurm controller (
usegalaxy.no
), and the compute node have identical copies of theetc/slurm/slurm.conf
file. If these differ it could cause problems. (Note: on all nodes, this file should now have been replaced with a symlink pointing to/srv/galaxy/slurm/slurm.conf
to ensure all nodes use the same configuration.) - Have a look at the Slurm troubleshooting guide.
- If the job goes directly from a queued state to an error state without "running" in between, or if no informative error message is given, there could be something wrong with the compute node the job was assigned to. (You can find which node this was by looking at the "hostname" under "Jobs Metrics" on the "View Details" information page for the dataset.) Try to drain that node to force all new jobs to run on other nodes instead:
sudo scontrol update nodename=XXX state=drain reason=issues
, where XXX is the name of the node (e.g. "slurm.usegalaxy.no" or "ecc2.usegalaxy.no"). - If a tool starts complaining about missing dependencies, there could be something wrong with its Singularity container that was not detected before, because the tool had never been run with the specific combination of arguments that triggered this behaviour. In this case you may have to rebuild the container image and manually include the necessary packages.
- If several different tools suddenly start complaining about missing dependencies, there could be something wrong with container resolution in Galaxy, which leads to all tools being run with the default container
/srv/galaxy/containers/galaxy-python.sif
. (Tip: If the work-directories for failed jobs have been kept (configured in "/srv/galaxy/config/galaxy.yml"), you can find the job execution script by running the commmandsudo find /data/part0/tmp/jobs/ -name galaxy_<JobID>.sh
(where<JobID>
is a number). The container that the job was run with can be found on the line containingsingularity -s exec
right after the-H /srv/galaxy/
argument.) One reason container resolution might fail is if the directory/srv/galaxy/containers/singularity/mulled/
contains files with unexpected names, as was the case with issue #82. - If a tool fails with the error "Failed to create user namespace: user namespace disabled", it is probably an issue with Singularity/Apptainer on the node where the tool was running. This should be solvable by installing the "apptainer-suid" package on that node with
sudo yum install apptainer-suid
(see issue #80). - If a tool fails with the error message "Unable to run this job due to a cluster error, please retry it later", it could be because the local disk is full; especially if the
/var/
-partition has been filled up with log-files in/var/log/.
Check withdf -h
and delete old log files if necessary. (You should probably check disk usage on both the main node and the compute nodes.) - If a tool seems unable to access the files it needs from CVMFS, see the section below.
- If the tool fails with the error
tool_script.sh: XXX: not found
, it means something is wrong with the dependencies. Check the<requirements>
section in tool wrapper. - If the tool is failing directly after starting with the error
This tool was disabled before the job completed. Please contact your Galaxy administrator.
, it could be solved by simply restarting the Galaxy server (sudo systemctl restart galaxy
). - If you immediately get a red error message in the middle panel instead of the expected tool parameters form when you click on the tool in the Tool Panel, it might perhaps be solved by restarting the Galaxy server (
sudo systemctl restart galaxy
). - If the tool starts but fails with an error that looks something like this:
FATAL: Unable to handle docker://.... uri: while building SIF from layers: ...: no space left on device
, it means that the Singularity container creation step has failed because there was not enough disk space. You can try building the container manually while pointing the environment variablesSINGULARITY_CACHEDIR
andSINGULARITY_TMPDIR
(orAPPTAINER_CACHEDIR
andAPPTAINER_TMPDIR
) to a location with more space. If the problem reoccurs many times, we should consider changing these permanently. - If a newly installed tool does not appear in the Tools Panel at all, check that the profile attribute in the tool wrapper does not specify a more recent Galaxy version than the one currently available on UseGalaxy.no. If it does, the Galaxy server will simply ignore that tool.
- If a tool is unable to download required files from the internet, check that the
/etc/resolv.conf
file (which lists available DNS servers) is properly mounted inside the container by runningsingularity exec <imagefile> cat /etc/resolv.conf
orsingularity exec <imagefile> ls -l /etc/resolv.conf
. If this file does not exist inside the container, you will have to build a new Singularity image for the tool. - For other problems, consult our tool troubleshooting flowchart
- Log into the node that the tool was running on and try to list or access the files in
/cvmfs/data.galaxyproject.org
or/cvmfs/data.usegalaxy.no
. - If that doesn't work, try probing the repositories with
sudo cvmfs_config probe data.galaxyproject.org
orsudo cvmfs_config probe data.usegalaxy.no
. - If that doesn't work either, try to restart "autofs" with
sudo systemctl restart autofs
. - And if that also fails, check that the disks are not full with
df -h
. If the/var
-partition (or root-partition/
) is full, it can cause trouble for CVMFS. Delete some log files in/var/log/
to free up space. - If the output from the
df -h
command contains lines like this:df: ‘/cvmfs/data.usegalaxy.no’: Transport endpoint is not connected
, you can try to manually unmount the CVMFS file systems withsudo fusermount -uz /cvmfs/data.galaxyproject.org; sudo fusermount -uz /cvmfs/data.usegalaxy.no
and the reload the CVMFS configuration withsudo cvmfs_config reload
. Check withls /cvmfs/data.galaxyproject.org
afterwards. - CVMFS has its own (troubleshooting page](https://cvmfs.readthedocs.io/en/stable/cpt-quickstart.html#troubleshooting)
- You can configure the amount of memory and cores each tool is allotted by following this documentation.
- If you are using the old NGA-based operations: Check that both the NGA Master and 2 NGA Runner processes are running with:
systemctl list-units --type service | grep nga
. If some of these are not running, restart them withsudo systemctl restart nga-master
,sudo systemctl restart nga-runner@1
andsudo systemctl restart nga-runner@2
, respectively.
- There could be problems loading the static resources (javascript and CSS) that are needed to dynamically generate the page contents after it has been loaded. Check the network log in your browser to see if some files return HTTP errors
- One of the disc partitions on the server may be full (check with
df -h
)
- Check the logs for problems with
sudo journalctl -u galaxy
. Add the-e
flag to see the last messages (using-n <N>
to control how many lines to see) or-f
to follow while the log is being updated. - If the log complains about
No usable temporary directory found in ['/tmp', '/var/tmp' ...]
, the disk may be full. Check withdf -h
and remove unnecessary files to free up space (especially log files in/var/log
and other temporary files).
- ... solution coming ...
Problem: The Galaxy Reports server just shows a blank page when you click on a Job ID number to show more information about a job
- This will typically happen after upgrading Galaxy. A simple restart of the Reports server with
sudo systemctl restart galaxy-reports
should fix it.
- There are several layers of firewalls in NREC. In addition to configuring the firewall on the compute node itself (which is done by our Ansible playbook), you must also add the node to a security group that allows internet access via the NREC administration interface.
- The "usegalaxy" module is something that we have made ourselves, and it contains our dynamic job dispatching rules that controls where and how various jobs are executed. It can be found in the directory "/srv/galaxy/dynamic_rules/usegalaxy". Make sure that the PYTHONPATH environment variable for the job handler processes includes the parent directory "/srv/galaxy/dynamic_rules".
Problem: The compute backend status page suddenly reports zero job executions in the last week, even though I know that some jobs have been run
I also get an error message when running the command "list_jobs" on the main node (this is an alias for sacct -u galaxy -X --format=JobID,JobName%30,Start,End,Elapsed,AllocCPUS,ReqMem,State,NodeList%20
).
- This is usually due to a problem with the connection between the SlurmDB daemon running on the main node and the MySQL/MariaDB database that stores the information on the database node (db.[test.]usegalaxy.no). Look in the logs located in "/var/log/slurm/slurmdbd.log". Sometimes, the connection can be restored by running
sudo mysqladmin flush-hosts
on the DB-node followed by a restart of SlurmDBD on the main node withsudo systemctl restart slurmdbd
.