Skip to content

Troubleshooting guide

Kjetil Klepper edited this page Oct 10, 2024 · 110 revisions

👉 Note that the Galaxy training network has also published their own troubleshooting guide!

Problem: Jobs are queued but not starting (stuck in "queued" or "new" states)

  • Go to the Jobs admin page in Galaxy and check that the job dispatching is not locked under Job Control.
  • Check the Slurm queue with the squeue command. If a job is listed as being in the "pending" state, it means that there is not enough cores or memory currently available to run the job. So, it will just have to wait until other jobs have completed first and enough resources have been released.
  • Check the status of the compute nodes with sinfo. If a node is in the drained or down states, it will not start any new jobs. You can undrain a node with the command: sudo scontrol update nodename=XXX state=resume, where XXX is the name of the node (e.g. "slurm.usegalaxy.no" or "ecc2.usegalaxy.no").
  • There could be other problems with Slurm as well. Check out this troubleshooting guide
  • Check that the directory "/srv/galaxy/server/dependencies" has file rights drwxr-xr-x (755). If this is not the case, you will have to change it with chmod 755 /srv/galaxy/server/dependencies. (We also have a cron-job which will set these files rights regularly if they are wrong)
  • If a tool has never been run before and has a unique set of dependency requirements, Galaxy will have to build a Singularity container with the required packages before running the tool. This will take time and the job will be shown as queued until this step is finished. You can check the logs to see if the container is being built. If Galaxy is unable to build a Singularity container with the required dependencies, the job can be stuck forever in a queued state. This can for instance happen if the tool wrapper relies on old style tool shed packages (defined in a tool_dependencies.xml file) rather than Conda packages. Examine the tool wrapper to see if this is the case.
  • Sometimes a job needs to pre-process a dataset before running (for instance to uncompress the file or convert it to another format). This can potentially take a long time and the job will not be shown as "running" until this step is finished. If this is the reason the job is still "queued", you just have to wait until it is done. (You should be able to see the hidden pre-processing job in the "Jobs" list on the Admin page.)
  • If a tool has specified that it needs a lot of resources to run (e.g. CPU cores or memory), the job will be stuck in the queue until those resources are available. If no compute node is able to meet the requirements, the job will never start. You can force Slurm to run the job anyway by lowering the resources for that job with the command: scontrol update job=<JobID> numcpus=<CORES> minmemorynode=<MB>.
  • Note that the test server uses the same tool_destinations.yaml file as the production server to specify resource requirements, but the compute nodes on the test stack have less resources. This means that some tools that work on the production server will not start on the test server, because the nodes do not have enough memory/CPU. To run these tools on the test server, you can modify the /srv/galaxy/dynamic_rules/usegalaxy/tool_destinations.yaml file to lower the requirements of the tool or run the scontrol update command mentioned above to adjust the resource requirements for a single job.
  • Some jobs that are in a "new" state may be part of a workflow and are just waiting for previous steps to complete before they can start.
  • If "new" jobs are still not being started and are not added to the Slurm queue (particularly "upload" jobs), it means that Galaxy has failed to pick up these new jobs for some reason. Try restarting the Galaxy server with sudo systemctl restart galaxy.
  • If upload jobs (and some other minor tasks) are stuck in a queued state but other tool jobs are running correctly, there may be problems with the Celery task worker. See the section below for solutions.

Problem: Jobs handled by Celery are not running

  • Take a look at the Celery logs with sudo journalctl -u galaxy-celery -e to identify the problem. You can also try to restart Celery with sudo systemctl restart galaxy-celery, if needed.
  • If Celery complains about certificate issues when connecting to RabbitMQ on the database-server, the problem could be that the "rabbitmq" user does not have read access to updated certificate files that have been recently installed by Certbot ("Cannot connect to amqp://galaxy:**@db.usegalaxy.no:5671/galaxy_internal: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed"). Log in to the database server and execute the command sudo setfacl -m u:rabbitmq:r /etc/letsencrypt/live/db.usegalaxy.no/xxx.pem for each file "xxx" in the directory. Restarting the RabbitMQ server with sudo systemctl restart rabbitmq-server may also help.

Problem: Jobs are running endlessly and not completing

  • Due to issues with our current CVMFS setup, jobs that use reference datasets may take a very long time to complete. Until these problems are fixed, we just have to be patient when running these jobs.
  • Login to the compute node where the jobs are running and check that the NFS-shared directories "/data/part0/" (which contain the datasets) and "/srv/galaxy/" are accessible. If they are not listed when you run df, you can try to remount these volumes with the command sudo mount -a. This should mount all the NFS-shares listed in "/etc/fstab" (assuming they are indeed included there). If the computer appears to hang when you access these directories or run df, you can try to restart the NFS service on the main node with sudo systemctl restart nfs. If the directories are still not available, but they are available on another compute node, you should drain the problematic node and move the jobs onto another node with the requeue command explained below. Then contact another sysadmin to help you.
  • You can check the tool's output to STDERR and STDOUT to see if they provide any hints to why the job is taking so long. First find the Galaxy jobID of the job (not the jobID of the Slurm job), then run the command sudo find /data/part0/tmp/jobs/ -type d -name <jobID> to find the job's working directory. Go to the "outputs" subdirectory and look at the files named "tool_stderr" and "tool_stderr".
  • You can try to requeue a running job with the command sudo scontrol requeue <job_id>. This should put the job back in the pending state, and it will be rerun when resources become available (which may be on the same node or a different node). If the node the job is currently running on seems slow, you should force the requeued jobs onto a different node by draining the current node first. Note that after a job has been requeued, it may end up in a completing (CG) state and appear to be stuck there for a long time. But this should hopefully resolve itself after a while, and the job will be assigned a new BeginTime (which may be in the future. See scontrol show job <jobID>)
  • Check that all the Slurm services are running correctly with systemctl status <service>. This includes the Slurm Controller (slurmctld) and Slurm Database (slurmdbd) on the main node, and the Slurm Daemon (slurmd) services on every compute nodes. Restart a service with sudo systemctl restart <service>, if needed.
  • If a job has been running suspiciously long on Slurm without completing, inspect the jobs via the Galaxy Admin page to check that the job in question is still in a "running" state in Galaxy. (If the job was started long ago, you may have to check the jobs via the reports server instead, since the Admin page may not list all jobs even if the cutoff time is set to a high value). If the job is listed as being in an "OK", "Error" or "Deleted" state instead of "Running", you should cancel the Slurm job manually with sudo scancel <jobID>.
  • If something terrible has happened to Slurm services during job execution, you may end up with runaway jobs. You can check this with the command sudo sacctmgr show runaway (which also gives you the option of fixing these jobs).

Problem: Jobs are stuck in a PAUSED state in Galaxy

  • If a job is part of a workflow and an upstream step fails with an error, the jobs coming after it will be placed in a PAUSED state.
  • Jobs can also be paused if the user has exceeded their assigned disk quota. In this case, the user must first free up space by deleting datasets they don't need anymore, and then select "Resume Paused Jobs" from the history options menu (cogwheel) to start them up.
  • If resumed jobs are still not being started by Galaxy (even if compute resources are available), you may have to restart the Galaxy server with sudo systemctl restart galaxy.

Problem: Jobs are stuck in a COMPLETING state (CG) in the Slurm queue

  • First make absolutely sure that the job/process is not running at all, and that it's not just taking a long time to complete.
  • If no other jobs are running on the same compute node, you should take down the node with sudo scontrol update nodename=XXX state=down reason=hang_proc and then bring it back up again with sudo scontrol update nodename=XXX state=resume. If there are other jobs still running on the same node, you should drain the node first and wait for these jobs to complete before you take down the node. [Ref: Slurm Troubleshooting guide]

Problem: A user has submitted a lot of jobs and is hogging the entire compute cluster

  • You can (temporarily) stop a running job with sudo systemctl requeuehold <jobid>. It will be placed back in a pending state, so it can be run later instead, but not until you explicitly release it from its hold with sudo systemctl release <jobid>.
  • If a job has not started yet because it is waiting for resources to become available, you can prevent it from starting with sudo systemctl hold <jobid>. It will not start until you release it.

Problem: One of the compute nodes is not responding

If you run the sinfo command and a node has a star behind its state, e.g. idle*, it means that the node is not responding and will not be assigned jobs.

  • Try pinging the node to see if it responds, or log in to the node with SSH. If that does not work, there is something wrong with the node itself or the network. If the node can be accessed normally, there is something wrong with Slurm or the Slurm configuration.
  • Log into the compute node and make sure the Slurm Daemon is running with systemctl status slurmd. If this service is not active, start it with sudo systemctl start slurmd. A restart of the service might also be beneficial sometimes. (Will that affect running jobs?)
  • Restarting the Slurm Controller (or even the Slurm Database) service on the main node (usegalaxy.no) can sometimes help: sudo systemctl restart slurmctl (and sudo systemctl restart slurmdbd). This should not affect running jobs.
  • Look into the logs to see if that can point you in the right direction. The logs can be found in /var/log/slurm/. Check both on the main node and the compute node.
  • Make sure the IP-address for the compute node is correct in etc/slurm/slurm.conf
  • Make sure that both the main node, which is running the Slurm controller (usegalaxy.no), and the compute node have identical copies of the etc/slurm/slurm.conf file. If these differ it could cause problems. (Note: on all nodes, this file should now have been replaced with a symlink pointing to /srv/galaxy/slurm/slurm.conf to ensure all nodes use the same configuration.)
  • Have a look at the Slurm troubleshooting guide.

Problem: A tool that has worked previously suddenly does not work anymore

  • If the job goes directly from a queued state to an error state without "running" in between, or if no informative error message is given, there could be something wrong with the compute node the job was assigned to. (You can find which node this was by looking at the "hostname" under "Jobs Metrics" on the "View Details" information page for the dataset.) Try to drain that node to force all new jobs to run on other nodes instead: sudo scontrol update nodename=XXX state=drain reason=issues, where XXX is the name of the node (e.g. "slurm.usegalaxy.no" or "ecc2.usegalaxy.no").
  • If a tool starts complaining about missing dependencies, there could be something wrong with its Singularity container that was not detected before, because the tool had never been run with the specific combination of arguments that triggered this behaviour. In this case you may have to rebuild the container image and manually include the necessary packages.
  • If several different tools suddenly start complaining about missing dependencies, there could be something wrong with container resolution in Galaxy, which leads to all tools being run with the default container /srv/galaxy/containers/galaxy-python.sif. (Tip: If the work-directories for failed jobs have been kept (configured in "/srv/galaxy/config/galaxy.yml"), you can find the job execution script by running the commmand sudo find /data/part0/tmp/jobs/ -name galaxy_<JobID>.sh (where <JobID> is a number). The container that the job was run with can be found on the line containing singularity -s exec right after the -H /srv/galaxy/ argument.) One reason container resolution might fail is if the directory /srv/galaxy/containers/singularity/mulled/ contains files with unexpected names, as was the case with issue #82.
  • If a tool fails with the error "Failed to create user namespace: user namespace disabled", it is probably an issue with Singularity/Apptainer on the node where the tool was running. This should be solvable by installing the "apptainer-suid" package on that node with sudo yum install apptainer-suid (see issue #80).
  • If a tool fails with the error message "Unable to run this job due to a cluster error, please retry it later", it could be because the local disk is full; especially if the /var/-partition has been filled up with log-files in /var/log/. Check with df -h and delete old log files if necessary. (You should probably check disk usage on both the main node and the compute nodes.)
  • If a tool seems unable to access the files it needs from CVMFS, see the section below.

Problem: A new tool is not working

  • If the tool fails with the error tool_script.sh: XXX: not found, it means something is wrong with the dependencies. Check the <requirements> section in tool wrapper.
  • If the tool is failing directly after starting with the error This tool was disabled before the job completed. Please contact your Galaxy administrator., it could be solved by simply restarting the Galaxy server (sudo systemctl restart galaxy).
  • If you immediately get a red error message in the middle panel instead of the expected tool parameters form when you click on the tool in the Tool Panel, it might perhaps be solved by restarting the Galaxy server (sudo systemctl restart galaxy).
  • If the tool starts but fails with an error that looks something like this: FATAL: Unable to handle docker://.... uri: while building SIF from layers: ...: no space left on device, it means that the Singularity container creation step has failed because there was not enough disk space. You can try building the container manually while pointing the environment variables SINGULARITY_CACHEDIR and SINGULARITY_TMPDIR (or APPTAINER_CACHEDIR and APPTAINER_TMPDIR) to a location with more space. If the problem reoccurs many times, we should consider changing these permanently.
  • If a newly installed tool does not appear in the Tools Panel at all, check that the profile attribute in the tool wrapper does not specify a more recent Galaxy version than the one currently available on UseGalaxy.no. If it does, the Galaxy server will simply ignore that tool.
  • If a tool is unable to download required files from the internet, check that the /etc/resolv.conf file (which lists available DNS servers) is properly mounted inside the container by running singularity exec <imagefile> cat /etc/resolv.conf or singularity exec <imagefile> ls -l /etc/resolv.conf. If this file does not exist inside the container, you will have to build a new Singularity image for the tool.
  • For other problems, consult our tool troubleshooting flowchart

Problem: Tools are not able to access files in CVMFS

  • Log into the node that the tool was running on and try to list or access the files in /cvmfs/data.galaxyproject.org or /cvmfs/data.usegalaxy.no.
  • If that doesn't work, try probing the repositories with sudo cvmfs_config probe data.galaxyproject.org or sudo cvmfs_config probe data.usegalaxy.no.
  • If that doesn't work either, try to restart "autofs" with sudo systemctl restart autofs.
  • And if that also fails, check that the disks are not full with df -h. If the /var-partition (or root-partition /) is full, it can cause trouble for CVMFS. Delete some log files in /var/log/ to free up space.
  • If the output from the df -h command contains lines like this: df: ‘/cvmfs/data.usegalaxy.no’: Transport endpoint is not connected, you can try to manually unmount the CVMFS file systems with sudo fusermount -uz /cvmfs/data.galaxyproject.org; sudo fusermount -uz /cvmfs/data.usegalaxy.no and the reload the CVMFS configuration with sudo cvmfs_config reload. Check with ls /cvmfs/data.galaxyproject.org afterwards.
  • Try wiping the CVMFS configuration cache with sudo cvmfs_config wipecache
  • Try killing all the CVMFS processes with sudo cvmfs_config killall
  • Check that the Squid proxies can be reached from the node. E.g. curl http://cvmfsproxy01.usegalaxy.no:3128 should return an HTML error page from the Squid proxy rather than just freeze. If it freezes, try disabling the proxy by setting CVMFS_HTTP_PROXY=DIRECT in the file "/etc/cvmfs/default.local" and reload the config somehow. If that works, then there is something wrong with the Squid or the connection to the Squid.
  • The official CVMFS documentation has its own troubleshooting page

Problem: A tool needs more memory or CPU than it is currently provided

  • You can configure the amount of memory and cores each tool is allotted by following this documentation.

Problem: import/export of histories between UseGalaxy.no and NeLS Storage does not work

  • If you are using the old NGA-based operations: Check that both the NGA Master and 2 NGA Runner processes are running with: systemctl list-units --type service | grep nga. If some of these are not running, restart them with sudo systemctl restart nga-master, sudo systemctl restart nga-runner@1 and sudo systemctl restart nga-runner@2, respectively.

Problem: The whole Galaxy page is blank (but it seems to have been loaded)

  • There could be problems loading the static resources (javascript and CSS) that are needed to dynamically generate the page contents after it has been loaded. Check the network log in your browser to see if some files return HTTP errors
  • One of the disc partitions on the server may be full (check with df -h)

Problem: Galaxy has crashed and does not start up correctly

  • Check the logs for problems with sudo journalctl -u galaxy. Add the -e flag to see the last messages (using -n <N> to control how many lines to see) or -f to follow while the log is being updated.
  • If the log complains about No usable temporary directory found in ['/tmp', '/var/tmp' ...], the disk may be full. Check with df -h and remove unnecessary files to free up space (especially log files in /var/log and other temporary files).

Problem: The Galaxy server returns a "504 Gateway time-out" error

  • ... solution coming ...

Problem: The Galaxy Reports server just shows a blank page when you click on a Job ID number to show more information about a job

  • This will typically happen after upgrading Galaxy. A simple restart of the Reports server with sudo systemctl restart galaxy-reports should fix it.

Problem: An NREC compute node is unable to access the internet

  • There are several layers of firewalls in NREC. In addition to configuring the firewall on the compute node itself (which is done by our Ansible playbook), you must also add the node to a security group that allows internet access via the NREC administration interface.

Problem: Job handlers complain about ModuleNotFoundError: No module named 'usegalaxy'

  • The badly named "usegalaxy" module is something that we have made ourselves, and it contains our dynamic job dispatching rules that controls where and how various jobs are executed (based on Galaxy Europe's Sorting Hat). It can be found in the directory "/srv/galaxy/dynamic_rules/usegalaxy". Make sure that the PYTHONPATH environment variable for the job handler processes includes the parent directory "/srv/galaxy/dynamic_rules".

Problem: The compute backend status page suddenly reports zero job executions in the last week, even though I know that some jobs have been run

I also get an error message when running the command "list_jobs" on the main node (this is an alias for sacct -u galaxy -X --format=JobID,JobName%30,Start,End,Elapsed,AllocCPUS,ReqMem,State,NodeList%20).

  • This is usually due to a problem with the connection between the SlurmDB daemon running on the main node and the MySQL/MariaDB database that stores the information on the database node (db.[test.]usegalaxy.no). Look in the logs located in "/var/log/slurm/slurmdbd.log". Sometimes, the connection can be restored by running sudo mysqladmin flush-hosts on the DB-node followed by a restart of SlurmDBD on the main node with sudo systemctl restart slurmdbd.