-
Notifications
You must be signed in to change notification settings - Fork 1
Common error messages
Below is a collection of error messages you might see.
This means that a process has received a SIGKILL (signal 9; the exit code is 128 + signal number)
This is often indicative of insufficient memory for a job. One way to fix this is to request more memory for the job, see Memory for more details.
This is an issue with reading or writing to disk. One common cause of such problems is memory mapping to the distributed file system: this is used by default with the JLD2.jl package, and can be disabled by specifying iotype=IOStream
in jldopen
/jldsave
.
This means a node is misconfigured, and is unable to execute the module
command (which should be available on all nodes on the cluster). Open a ticket with IMSS, giving the hostname of the node.
A GPU is faulty. Open a ticket with IMSS, giving the hostname of the node.
Errors like
ERROR: The following 9 direct dependencies failed to precompile:
Combinatorics [861a8166-3701-5b0c-9a16-15d98fcdc6aa]
Failed to precompile Combinatorics [861a8166-3701-5b0c-9a16-15d98fcdc6aa] to /central/scratch/esm/slurm-buildkite/calibrateedmf-ci/depot/cpu/compiled/v1.7/Combinatorics/jl_Gv23Ay.
ERROR: LoadError: SystemError: opening file "/central/scratch/esm/slurm-buildkite/calibrateedmf-ci/depot/cpu/packages/Combinatorics/Udg6X/src/numbers.jl": No such file or directory
are due to upstream issues in Julia. The issue is a race condition that we experience when running multiple processes. For more information, please see the issue (#31953).
The source of the race conditions can happen in multiple ways. And the issue is more common when using the JULIA_DEPOT_PATH
, which is specified in our .buildkite/pipeline.yml
files.
We "accept" using this flakey test configuration because it speeds up initialization from ~10-15 min per build to ~1 min per build. We could abandon using the Julia depot path, but then our continuous integration test times, and the time until the tests start, will take 10-15 min longer. So using the Julia depot path has upsides and downsides.
Sometimes this race condition can lead to a corrupted depot path, in which case we need to clear (delete) the depot path on Caltech Central in order to un-break CI. These issues are per-repo, and we have a few buildkite pipelines dedicated to making this easy (for those repos using the Julia depot path).
If you want to opt-out of using the Julia depot path, then simply delete the environment variable in the buildkite yaml file, which looks like this:
env:
JULIA_DEPOT_PATH: "${BUILDKITE_BUILD_PATH}/${BUILDKITE_PIPELINE_SLUG}/depot/cpu"
Older versions of julia did not support importing packages with the syntax
import OrdinaryDiffEq as ODE
One way around this was to define a constant
import OrdinaryDiffEq
const ODE = OrdinaryDiffEq
However, these two ways of importing are in conflict with one another, and trying them both together, in the same scope, results in the error ERROR: LoadError: importing ODE into Main conflicts with an existing identifier
.
TODO: document
perl: error: get_addr_info: getaddrinfo() failed: Name or service not known
perl: error: slurm_set_addr: Unable to resolve "head1"
perl: error: slurm_get_port: Address family '0' not supported
perl: error: Error connecting, bad data: family = 0, port = 0
perl: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:head1:6819: No such file or directory
perl: error: Sending PersistInit msg: No such file or directory
perl: error: get_addr_info: getaddrinfo() failed: Name or service not known
perl: error: slurm_set_addr: Unable to resolve "head1"
perl: error: Unable to establish control machine address
Use of uninitialized value in subroutine entry at /central/slurm/install/current/bin/seff line 57, <DATA> line 602.
perl: error: get_addr_info: getaddrinfo() failed: Name or service not known
perl: error: slurm_set_addr: Unable to resolve "head1"
perl: error: slurm_get_port: Address family '0' not supported
perl: error: Error connecting, bad data: family = 0, port = 0
perl: error: Sending PersistInit msg: No such file or directory
perl: error: DBD_GET_JOBS_COND failure: Unspecified error
This message can be indicative of a corrupted depot. To solve it, go to the dedicated clear depot pipeline and do New Build
(you can use "clear depot" as the build message).
Caution: if you want to clear the depot, it is better to do this when no other builds are running. Otherwise, the depot can easily get corrupted again (and initializing a new depot typically takes ~15 min). Always warn team members about clearing the depot so that everybody knows.
TODO: document
This is caused when attempting to use the Julia CUDA runtime artifact, while also having a cuda
cluster module loaded. You should either
- (preferred) configure CUDA to use the local CUDA runtime, or
- not load a
cuda
module.
See https://github.com/JuliaGPU/CUDA.jl/issues/1755 for more details.
Example:
--------------------------------------------------------------------------
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
Hostname: clima
cuIpcOpenMemHandle return value: 1
address: 0x7fd865200000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory. Try to reduce the device
memory footprint of your application.
--------------------------------------------------------------------------
[clima.gps.caltech.edu:1868819] Failed to register remote memory, rc=-1
There are two typical causes:
-
CUDA-aware MPI is not compatible with the (default) pool allocator in CUDA.jl. Solution is to disable the CUDA.jl memory pool:
export JULIA_CUDA_MEMORY_POOL=none
-
Slurm is restricting access to GPUs on different ranks. Disable Slurm GPU binding: launch
srun
with--gpu-bind=none
, or setexport SLURM_GPU_BIND=none
Error:
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
If this error is occurring on the Caltech HPC when using NetCDF.jl, it is because NetCDF.jl is attempting to use the JLL-provided MPI instead of the system one.
To fix this, you need to include the following snippet. It will cause NetCDF to use the correct version of MPI.
using ClimaComms; ClimaComms.init(ClimaComms.context())
Update: The same error just came up when running a GPU job with srun
, even with the ClimaComms line above. To avoid this situation (without running on MPI) I added the --mpi=none
flag to the srun
command.