Common error messages

Below is a collection of error messages you might see.

`Error: The command exited with status 137`

This means that a process has received a SIGKILL (signal 9; the exit code is 128 + signal number)

This is often indicative of insufficient memory for a job. One way to fix this is to request more memory for the job, see Memory for more details.

`Error: The command exited with status 135` / `signal (7): Bus error`

This is an issue with reading or writing to disk. One common cause of such problems is memory mapping to the distributed file system: this is used by default with the JLD2.jl package, and can be disabled by specifying iotype=IOStream in jldopen/jldsave.

`module: command not found`

This means a node is misconfigured, and is unable to execute the module command (which should be available on all nodes on the cluster). Open a ticket with IMSS, giving the hostname of the node.

`CUDA error: uncorrectable ECC error encountered (code 214, ERROR_ECC_UNCORRECTABLE)`

A GPU is faulty. Open a ticket with IMSS, giving the hostname of the node.

`ERROR: LoadError: SystemError: No such file or directory (Julia depot path error)`

Errors like

ERROR: The following 9 direct dependencies failed to precompile:

Combinatorics [861a8166-3701-5b0c-9a16-15d98fcdc6aa]

Failed to precompile Combinatorics [861a8166-3701-5b0c-9a16-15d98fcdc6aa] to /central/scratch/esm/slurm-buildkite/calibrateedmf-ci/depot/cpu/compiled/v1.7/Combinatorics/jl_Gv23Ay.
ERROR: LoadError: SystemError: opening file "/central/scratch/esm/slurm-buildkite/calibrateedmf-ci/depot/cpu/packages/Combinatorics/Udg6X/src/numbers.jl": No such file or directory

are due to upstream issues in Julia. The issue is a race condition that we experience when running multiple processes. For more information, please see the issue (#31953).

The source of the race conditions can happen in multiple ways. And the issue is more common when using the JULIA_DEPOT_PATH, which is specified in our .buildkite/pipeline.yml files.

We "accept" using this flakey test configuration because it speeds up initialization from ~10-15 min per build to ~1 min per build. We could abandon using the Julia depot path, but then our continuous integration test times, and the time until the tests start, will take 10-15 min longer. So using the Julia depot path has upsides and downsides.

Sometimes this race condition can lead to a corrupted depot path, in which case we need to clear (delete) the depot path on Caltech Central in order to un-break CI. These issues are per-repo, and we have a few buildkite pipelines dedicated to making this easy (for those repos using the Julia depot path).

If you want to opt-out of using the Julia depot path, then simply delete the environment variable in the buildkite yaml file, which looks like this:

env:
  JULIA_DEPOT_PATH: "${BUILDKITE_BUILD_PATH}/${BUILDKITE_PIPELINE_SLUG}/depot/cpu"

`ERROR: LoadError: importing ___ into Main conflicts with an existing identifier`

Older versions of julia did not support importing packages with the syntax

import OrdinaryDiffEq as ODE

One way around this was to define a constant

import OrdinaryDiffEq
const ODE = OrdinaryDiffEq

However, these two ways of importing are in conflict with one another, and trying them both together, in the same scope, results in the error ERROR: LoadError: importing ODE into Main conflicts with an existing identifier.

perl: error: get_addr_info: getaddrinfo()

TODO: document

perl: error: get_addr_info: getaddrinfo() failed: Name or service not known
perl: error: slurm_set_addr: Unable to resolve "head1"
perl: error: slurm_get_port: Address family '0' not supported
perl: error: Error connecting, bad data: family = 0, port = 0
perl: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:head1:6819: No such file or directory
perl: error: Sending PersistInit msg: No such file or directory
perl: error: get_addr_info: getaddrinfo() failed: Name or service not known
perl: error: slurm_set_addr: Unable to resolve "head1"
perl: error: Unable to establish control machine address
Use of uninitialized value in subroutine entry at /central/slurm/install/current/bin/seff line 57, <DATA> line 602.
perl: error: get_addr_info: getaddrinfo() failed: Name or service not known
perl: error: slurm_set_addr: Unable to resolve "head1"
perl: error: slurm_get_port: Address family '0' not supported
perl: error: Error connecting, bad data: family = 0, port = 0
perl: error: Sending PersistInit msg: No such file or directory
perl: error: DBD_GET_JOBS_COND failure: Unspecified error

`LLVM ERROR: Do not know how to split the result of this operator!`

This message can be indicative of a corrupted depot. To solve it, go to the dedicated clear depot pipeline and do New Build (you can use "clear depot" as the build message).

Caution: if you want to clear the depot, it is better to do this when no other builds are running. Otherwise, the depot can easily get corrupted again (and initializing a new depot typically takes ~15 min). Always warn team members about clearing the depot so that everybody knows.

🚨 Error: The global post-command hook exited with status 2

TODO: document

`DivideError: integer division error` when calling `norm` on CuArray

This is caused when attempting to use the Julia CUDA runtime artifact, while also having a cuda cluster module loaded. You should either

(preferred) configure CUDA to use the local CUDA runtime, or
not load a cuda module.

See https://github.com/JuliaGPU/CUDA.jl/issues/1755 for more details.

`The call to cuIpcOpenMemHandle failed. This is an unrecoverable error`

Example:

--------------------------------------------------------------------------
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
  Hostname:                         clima
  cuIpcOpenMemHandle return value:  1
  address:                          0x7fd865200000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory.  Try to reduce the device
memory footprint of your application.
--------------------------------------------------------------------------
[clima.gps.caltech.edu:1868819] Failed to register remote memory, rc=-1

There are two typical causes:

CUDA-aware MPI is not compatible with the (default) pool allocator in CUDA.jl. Solution is to disable the CUDA.jl memory pool:
```
export JULIA_CUDA_MEMORY_POOL=none
```
Slurm is restricting access to GPUs on different ranks. Disable Slurm GPU binding: launch srun with --gpu-bind=none, or set
```
export SLURM_GPU_BIND=none
```

OMPI was not built with SLURM's PMI support

Error:

The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.

If this error is occurring on the Caltech HPC when using NetCDF.jl, it is because NetCDF.jl is attempting to use the JLL-provided MPI instead of the system one.

To fix this, you need to include the following snippet. It will cause NetCDF to use the correct version of MPI.

using ClimaComms; ClimaComms.init(ClimaComms.context())

Update: The same error just came up when running a GPU job with srun, even with the ClimaComms line above. To avoid this situation (without running on MPI) I added the --mpi=none flag to the srun command.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Common error messages

`Error: The command exited with status 137`

`Error: The command exited with status 135` / `signal (7): Bus error`

`module: command not found`

`CUDA error: uncorrectable ECC error encountered (code 214, ERROR_ECC_UNCORRECTABLE)`

`ERROR: LoadError: SystemError: No such file or directory (Julia depot path error)`

`ERROR: LoadError: importing ___ into Main conflicts with an existing identifier`

perl: error: get_addr_info: getaddrinfo()

`LLVM ERROR: Do not know how to split the result of this operator!`

🚨 Error: The global post-command hook exited with status 2

`DivideError: integer division error` when calling `norm` on CuArray

`The call to cuIpcOpenMemHandle failed. This is an unrecoverable error`

OMPI was not built with SLURM's PMI support

Clone this wiki locally

Common error messages

Error: The command exited with status 137

Error: The command exited with status 135 / signal (7): Bus error

module: command not found

CUDA error: uncorrectable ECC error encountered (code 214, ERROR_ECC_UNCORRECTABLE)

ERROR: LoadError: SystemError: No such file or directory (Julia depot path error)

ERROR: LoadError: importing ___ into Main conflicts with an existing identifier

perl: error: get_addr_info: getaddrinfo()

LLVM ERROR: Do not know how to split the result of this operator!

🚨 Error: The global post-command hook exited with status 2

DivideError: integer division error when calling norm on CuArray

The call to cuIpcOpenMemHandle failed. This is an unrecoverable error

OMPI was not built with SLURM's PMI support

Clone this wiki locally

`Error: The command exited with status 137`

`Error: The command exited with status 135` / `signal (7): Bus error`

`module: command not found`

`CUDA error: uncorrectable ECC error encountered (code 214, ERROR_ECC_UNCORRECTABLE)`

`ERROR: LoadError: SystemError: No such file or directory (Julia depot path error)`

`ERROR: LoadError: importing ___ into Main conflicts with an existing identifier`

`LLVM ERROR: Do not know how to split the result of this operator!`

`DivideError: integer division error` when calling `norm` on CuArray

`The call to cuIpcOpenMemHandle failed. This is an unrecoverable error`