seg fault in pio_syncfile #14

kshedstrom · 2015-09-24T00:30:45Z

I have a case which runs on 4 cores with 4 pio tasks and blows up with 1 or 2 pio tasks. It's dying in the call to pio_syncfile. I get pages of this sort of output (see below), then the seg fault. The first active pio process is the one that complains from inside memcpy. Here's the stack trace:

 memcpy,                              FP=7fff8065b560
 ADIOI_NFS_WriteStrided,              FP=7fff8065b6f0
 ADIOI_GEN_WriteStridedColl,          FP=7fff8065bb00
 MPIOI_File_write_all,                FP=7fff8065bb70
 mca_io_romio_dist_MPI_File_write_at_all, FP=7fff8065bb90
 PMPI_File_write_at_all,              FP=7fff8065bbd0
 ncmpii_mgetput,                      FP=7fff8065bca0
 ncmpii_req_aggregation,              FP=7fff8065bdc0
 ncmpii_wait_getput,                  FP=7fff8065be30
 ncmpii_wait,                         FP=7fff8065bf30
 ncmpi_wait_all,                      FP=7fff8065bf70
 flush_output_buffer,                 FP=7fff8067bfb0
 PIOc_write_darray_multi,             FP=7fff8067c0b0
 flush_buffer,                        FP=7fff8067c0e0
 PIOc_sync,                           FP=7fff8067c130

mod_pio`netcdf_sync, FP=7fff8067c170

GPTLstop: GPTLinitialize has not been called
GPTLstart name=PIO:write_darray_multi_nc: GPTLinitialize has not been called
GPTLstop: GPTLinitialize has not been called
GPTLstart name=PIO:flush_output_buffer: GPTLinitialize has not been called
/archive/u1/uaf/kate/src/parallelio/src/clib/pio_darray.c 1362 2
/archive/u1/uaf/kate/src/parallelio/src/clib/pio_darray.c 1366 2
GPTLstop: GPTLinitialize has not been called
GPTLstart name=PIO:write_darray_multi_nc: GPTLinitialize has not been called
GPTLstop: GPTLinitialize has not been called
GPTLstart name=PIO:flush_output_buffer: GPTLinitialize has not been called
GPTLstop: GPTLinitialize has not been called

GPTLstart name=PIO:rearrange_comp2io: GPTLinitialize has not been called

mpirun noticed that process rank 1 with PID 150416 on node pacman3 exited on signal 11 (Segmentation fault).

Currently Loaded Modulefiles:

ncl/6.1.0 10) proj/4.8.0.gnu-4.7.3
git/2.3.0 11) pyngl/1.4.0
nco/4.3.1.gnu-4.7.3 12) totalview/8.12.0-0
gcc/4.7.3 13) ncview/2.1.2
openmpi-gnu-4.7.3/1.4.3 14) matlab/R2014a
PrgEnv-gnu/4.7.3 15) hdf5/1.8.10-p1.gnu-4.7.3
proj/4.9.1.gnu-4.7.3 16) wgrib2/1.9.6a
gdal/1.10.0 17) jdk/1.8.0
python/2.7.4 18) panoply/4.0.4

pnetcdf is 1.6.1

The text was updated successfully, but these errors were encountered:

Katetc · 2015-09-24T18:07:13Z

Hi Kate,

We will be happy to look into this. Could you provide us with a case description for when you get this error? What kind of machine are you running on?
Thanks!

kshedstrom · 2015-09-24T18:33:55Z

Hi Kate,

I'm trying to reproduce this on Yellowstone. I've been running on a Penguin
Computing Linux machine at ARSC.

Kate

On Thu, Sep 24, 2015 at 10:07 AM, Kate Thayer-Calder <
[email protected]> wrote:

Hi Kate,

We will be happy to look into this. Could you provide us with a case
description for when you get this error? What kind of machine are you
running on?
Thanks!

—
Reply to this email directly or view it on GitHub
#14 (comment)
.

kshedstrom · 2015-09-24T18:56:32Z

Kate,

This sounds stupid I'm sure, but I've got an interactive session on caldera
and I don't know how to run the model in parallel.
If I try "poe -np 4 ./oceanM ocean_solition.in", it asks for the name of
the command to run and I give it "oceanM ocean_soliton.in", but the oceanM
executable doesn't see the ocean_soliton.in argument (it reads a filename
there).

Kate

On Thu, Sep 24, 2015 at 10:33 AM, Katherine Hedstrom [email protected]
wrote:

Hi Kate,

I'm trying to reproduce this on Yellowstone. I've been running on a
Penguin Computing Linux machine at ARSC.

Kate

On Thu, Sep 24, 2015 at 10:07 AM, Kate Thayer-Calder <
[email protected]> wrote:

Hi Kate,

We will be happy to look into this. Could you provide us with a case
description for when you get this error? What kind of machine are you
running on?
Thanks!

—
Reply to this email directly or view it on GitHub
#14 (comment)
.

kshedstrom · 2015-09-24T21:22:31Z

Hi Kate,

It runs on Yellowstone, no problem.

Kate

On Thu, Sep 24, 2015 at 10:33 AM, Katherine Hedstrom [email protected]
wrote:

Hi Kate,

I'm trying to reproduce this on Yellowstone. I've been running on a
Penguin Computing Linux machine at ARSC.

Kate

On Thu, Sep 24, 2015 at 10:07 AM, Kate Thayer-Calder <
[email protected]> wrote:

Hi Kate,

We will be happy to look into this. Could you provide us with a case
description for when you get this error? What kind of machine are you
running on?
Thanks!

—
Reply to this email directly or view it on GitHub
#14 (comment)
.

jedwards4b · 2015-09-24T21:43:02Z

To run on caldera in an interactive session you use
mpirun.lsf ./oceanM ...

On Thu, Sep 24, 2015 at 12:56 PM, Kate Hedstrom [email protected]
wrote:

Kate,

This sounds stupid I'm sure, but I've got an interactive session on caldera
and I don't know how to run the model in parallel.
If I try "poe -np 4 ./oceanM ocean_solition.in", it asks for the name of
the command to run and I give it "oceanM ocean_soliton.in", but the oceanM
executable doesn't see the ocean_soliton.in argument (it reads a filename
there).

Kate

On Thu, Sep 24, 2015 at 10:33 AM, Katherine Hedstrom <
[email protected]>
wrote:

Hi Kate,

I'm trying to reproduce this on Yellowstone. I've been running on a
Penguin Computing Linux machine at ARSC.

Kate

On Thu, Sep 24, 2015 at 10:07 AM, Kate Thayer-Calder <
[email protected]> wrote:

Hi Kate,

We will be happy to look into this. Could you provide us with a case
description for when you get this error? What kind of machine are you
running on?
Thanks!

—
Reply to this email directly or view it on GitHub
<
https://github.com/PARALLELIO/ParallelIO/issues/14#issuecomment-143007167>
.

—
Reply to this email directly or view it on GitHub
#14 (comment)
.

Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO

kshedstrom · 2015-09-25T21:52:32Z

Hi Jim and Kate,

Since ifort was working for me on Yellowstone, I built everything with the
intel compilers on pacman. It's also dying in a call to pio_syncfile:

[pacman9:483725] *** An error occurred in MPI_Alltoallw
[pacman9:483725] *** reported by process [2399141889,2]
[pacman9:483725] *** on communicator MPI COMMUNICATOR 4 SPLIT FROM 0
[pacman9:483725] *** MPI_ERR_TYPE: invalid datatype
[pacman9:483725] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[pacman9:483725] *** and potentially your MPI job)
[pacman9:483721] 3 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[pacman9:483721] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages

Kate

On Thu, Sep 24, 2015 at 1:43 PM, jedwards4b [email protected]
wrote:

To run on caldera in an interactive session you use
mpirun.lsf ./oceanM ...

On Thu, Sep 24, 2015 at 12:56 PM, Kate Hedstrom [email protected]
wrote:

Kate,

This sounds stupid I'm sure, but I've got an interactive session on
caldera
and I don't know how to run the model in parallel.
If I try "poe -np 4 ./oceanM ocean_solition.in", it asks for the name of
the command to run and I give it "oceanM ocean_soliton.in", but the
oceanM
executable doesn't see the ocean_soliton.in argument (it reads a
filename
there).

Kate

On Thu, Sep 24, 2015 at 10:33 AM, Katherine Hedstrom <
[email protected]>
wrote:

Hi Kate,

I'm trying to reproduce this on Yellowstone. I've been running on a
Penguin Computing Linux machine at ARSC.

Kate

On Thu, Sep 24, 2015 at 10:07 AM, Kate Thayer-Calder <
[email protected]> wrote:

Hi Kate,

We will be happy to look into this. Could you provide us with a case
description for when you get this error? What kind of machine are you
running on?
Thanks!

—
Reply to this email directly or view it on GitHub
<

https://github.com/PARALLELIO/ParallelIO/issues/14#issuecomment-143007167>

.

—
Reply to this email directly or view it on GitHub
<
https://github.com/PARALLELIO/ParallelIO/issues/14#issuecomment-143019048>
.

Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO

—
Reply to this email directly or view it on GitHub
#14 (comment)
.

Katetc · 2015-09-28T19:19:20Z

Hi Kate,

We have a local linux cluster where I'd like to try to reproduce this problem, but I need a little more information. How are you configuring PIO to start? When you say you are going from 4 tasks to 2, how are you making that change? Any information you can give me about how best to try to reproduce this error would be very helpful. If you have code that you can tar up and send to me, that would be REALLY great, but I understand that can be difficult. My email is [email protected], if you want to directly contact me.

Thanks!
Kate

Katetc · 2015-11-09T20:59:11Z

Update: It looks like some of the issues were caused by an incorrect configuration:
{{{
! PIO number of I/O tasks

 NIOTASKS = 1

PIO_STRIDE = 4
}}}
This caused a runtime seg fault but the reverse (PIO_STRIDE=1, NIOTASKS=4) worked. The question remains as to whether we want to support a configuration as above, which could, in theory, be done. However, it is not a high priority currently as this is not a common configuration. Issue will be moved to JIRA as a feature request.

Katetc closed this as completed Nov 9, 2015

jayeshkrishna mentioned this issue Nov 7, 2016

pio_decomp_fillval fails on ANL workstation #179

Closed

alan-parry mentioned this issue Nov 27, 2019

Building PIO with IntelMPI #1613

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seg fault in pio_syncfile #14

seg fault in pio_syncfile #14

kshedstrom commented Sep 24, 2015

Katetc commented Sep 24, 2015

kshedstrom commented Sep 24, 2015

kshedstrom commented Sep 24, 2015

kshedstrom commented Sep 24, 2015

jedwards4b commented Sep 24, 2015

kshedstrom commented Sep 25, 2015

Katetc commented Sep 28, 2015

Katetc commented Nov 9, 2015

seg fault in pio_syncfile #14

seg fault in pio_syncfile #14

Comments

kshedstrom commented Sep 24, 2015

GPTLstart name=PIO:rearrange_comp2io: GPTLinitialize has not been called

mpirun noticed that process rank 1 with PID 150416 on node pacman3 exited on signal 11 (Segmentation fault).

Katetc commented Sep 24, 2015

kshedstrom commented Sep 24, 2015

kshedstrom commented Sep 24, 2015

kshedstrom commented Sep 24, 2015

jedwards4b commented Sep 24, 2015

kshedstrom commented Sep 25, 2015

Katetc commented Sep 28, 2015

Katetc commented Nov 9, 2015