Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seg fault in pio_syncfile #14

Closed
kshedstrom opened this issue Sep 24, 2015 · 8 comments
Closed

seg fault in pio_syncfile #14

kshedstrom opened this issue Sep 24, 2015 · 8 comments

Comments

@kshedstrom
Copy link

I have a case which runs on 4 cores with 4 pio tasks and blows up with 1 or 2 pio tasks. It's dying in the call to pio_syncfile. I get pages of this sort of output (see below), then the seg fault. The first active pio process is the one that complains from inside memcpy. Here's the stack trace:

 memcpy,                              FP=7fff8065b560
 ADIOI_NFS_WriteStrided,              FP=7fff8065b6f0
 ADIOI_GEN_WriteStridedColl,          FP=7fff8065bb00
 MPIOI_File_write_all,                FP=7fff8065bb70
 mca_io_romio_dist_MPI_File_write_at_all, FP=7fff8065bb90
 PMPI_File_write_at_all,              FP=7fff8065bbd0
 ncmpii_mgetput,                      FP=7fff8065bca0
 ncmpii_req_aggregation,              FP=7fff8065bdc0
 ncmpii_wait_getput,                  FP=7fff8065be30
 ncmpii_wait,                         FP=7fff8065bf30
 ncmpi_wait_all,                      FP=7fff8065bf70
 flush_output_buffer,                 FP=7fff8067bfb0
 PIOc_write_darray_multi,             FP=7fff8067c0b0
 flush_buffer,                        FP=7fff8067c0e0
 PIOc_sync,                           FP=7fff8067c130

mod_pio`netcdf_sync, FP=7fff8067c170

GPTLstop: GPTLinitialize has not been called
GPTLstart name=PIO:write_darray_multi_nc: GPTLinitialize has not been called
GPTLstop: GPTLinitialize has not been called
GPTLstart name=PIO:flush_output_buffer: GPTLinitialize has not been called
/archive/u1/uaf/kate/src/parallelio/src/clib/pio_darray.c 1362 2
/archive/u1/uaf/kate/src/parallelio/src/clib/pio_darray.c 1366 2
GPTLstop: GPTLinitialize has not been called
GPTLstart name=PIO:write_darray_multi_nc: GPTLinitialize has not been called
GPTLstop: GPTLinitialize has not been called
GPTLstart name=PIO:flush_output_buffer: GPTLinitialize has not been called
GPTLstop: GPTLinitialize has not been called

GPTLstart name=PIO:rearrange_comp2io: GPTLinitialize has not been called

mpirun noticed that process rank 1 with PID 150416 on node pacman3 exited on signal 11 (Segmentation fault).

Currently Loaded Modulefiles:

  1. ncl/6.1.0 10) proj/4.8.0.gnu-4.7.3
  2. git/2.3.0 11) pyngl/1.4.0
  3. nco/4.3.1.gnu-4.7.3 12) totalview/8.12.0-0
  4. gcc/4.7.3 13) ncview/2.1.2
  5. openmpi-gnu-4.7.3/1.4.3 14) matlab/R2014a
  6. PrgEnv-gnu/4.7.3 15) hdf5/1.8.10-p1.gnu-4.7.3
  7. proj/4.9.1.gnu-4.7.3 16) wgrib2/1.9.6a
  8. gdal/1.10.0 17) jdk/1.8.0
  9. python/2.7.4 18) panoply/4.0.4

pnetcdf is 1.6.1

@Katetc
Copy link
Contributor

Katetc commented Sep 24, 2015

Hi Kate,

We will be happy to look into this. Could you provide us with a case description for when you get this error? What kind of machine are you running on?
Thanks!

@kshedstrom
Copy link
Author

Hi Kate,

I'm trying to reproduce this on Yellowstone. I've been running on a Penguin
Computing Linux machine at ARSC.

Kate

On Thu, Sep 24, 2015 at 10:07 AM, Kate Thayer-Calder <
[email protected]> wrote:

Hi Kate,

We will be happy to look into this. Could you provide us with a case
description for when you get this error? What kind of machine are you
running on?
Thanks!


Reply to this email directly or view it on GitHub
#14 (comment)
.

@kshedstrom
Copy link
Author

Kate,

This sounds stupid I'm sure, but I've got an interactive session on caldera
and I don't know how to run the model in parallel.
If I try "poe -np 4 ./oceanM ocean_solition.in", it asks for the name of
the command to run and I give it "oceanM ocean_soliton.in", but the oceanM
executable doesn't see the ocean_soliton.in argument (it reads a filename
there).

Kate

On Thu, Sep 24, 2015 at 10:33 AM, Katherine Hedstrom [email protected]
wrote:

Hi Kate,

I'm trying to reproduce this on Yellowstone. I've been running on a
Penguin Computing Linux machine at ARSC.

Kate

On Thu, Sep 24, 2015 at 10:07 AM, Kate Thayer-Calder <
[email protected]> wrote:

Hi Kate,

We will be happy to look into this. Could you provide us with a case
description for when you get this error? What kind of machine are you
running on?
Thanks!


Reply to this email directly or view it on GitHub
#14 (comment)
.

@kshedstrom
Copy link
Author

Hi Kate,

It runs on Yellowstone, no problem.

Kate

On Thu, Sep 24, 2015 at 10:33 AM, Katherine Hedstrom [email protected]
wrote:

Hi Kate,

I'm trying to reproduce this on Yellowstone. I've been running on a
Penguin Computing Linux machine at ARSC.

Kate

On Thu, Sep 24, 2015 at 10:07 AM, Kate Thayer-Calder <
[email protected]> wrote:

Hi Kate,

We will be happy to look into this. Could you provide us with a case
description for when you get this error? What kind of machine are you
running on?
Thanks!


Reply to this email directly or view it on GitHub
#14 (comment)
.

@jedwards4b
Copy link
Contributor

To run on caldera in an interactive session you use
mpirun.lsf ./oceanM ...

On Thu, Sep 24, 2015 at 12:56 PM, Kate Hedstrom [email protected]
wrote:

Kate,

This sounds stupid I'm sure, but I've got an interactive session on caldera
and I don't know how to run the model in parallel.
If I try "poe -np 4 ./oceanM ocean_solition.in", it asks for the name of
the command to run and I give it "oceanM ocean_soliton.in", but the oceanM
executable doesn't see the ocean_soliton.in argument (it reads a filename
there).

Kate

On Thu, Sep 24, 2015 at 10:33 AM, Katherine Hedstrom <
[email protected]>
wrote:

Hi Kate,

I'm trying to reproduce this on Yellowstone. I've been running on a
Penguin Computing Linux machine at ARSC.

Kate

On Thu, Sep 24, 2015 at 10:07 AM, Kate Thayer-Calder <
[email protected]> wrote:

Hi Kate,

We will be happy to look into this. Could you provide us with a case
description for when you get this error? What kind of machine are you
running on?
Thanks!


Reply to this email directly or view it on GitHub
<
https://github.com/PARALLELIO/ParallelIO/issues/14#issuecomment-143007167>
.


Reply to this email directly or view it on GitHub
#14 (comment)
.

Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO

@kshedstrom
Copy link
Author

Hi Jim and Kate,

Since ifort was working for me on Yellowstone, I built everything with the
intel compilers on pacman. It's also dying in a call to pio_syncfile:

[pacman9:483725] *** An error occurred in MPI_Alltoallw
[pacman9:483725] *** reported by process [2399141889,2]
[pacman9:483725] *** on communicator MPI COMMUNICATOR 4 SPLIT FROM 0
[pacman9:483725] *** MPI_ERR_TYPE: invalid datatype
[pacman9:483725] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[pacman9:483725] *** and potentially your MPI job)
[pacman9:483721] 3 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[pacman9:483721] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages

Kate

On Thu, Sep 24, 2015 at 1:43 PM, jedwards4b [email protected]
wrote:

To run on caldera in an interactive session you use
mpirun.lsf ./oceanM ...

On Thu, Sep 24, 2015 at 12:56 PM, Kate Hedstrom [email protected]
wrote:

Kate,

This sounds stupid I'm sure, but I've got an interactive session on
caldera
and I don't know how to run the model in parallel.
If I try "poe -np 4 ./oceanM ocean_solition.in", it asks for the name of
the command to run and I give it "oceanM ocean_soliton.in", but the
oceanM
executable doesn't see the ocean_soliton.in argument (it reads a
filename
there).

Kate

On Thu, Sep 24, 2015 at 10:33 AM, Katherine Hedstrom <
[email protected]>
wrote:

Hi Kate,

I'm trying to reproduce this on Yellowstone. I've been running on a
Penguin Computing Linux machine at ARSC.

Kate

On Thu, Sep 24, 2015 at 10:07 AM, Kate Thayer-Calder <
[email protected]> wrote:

Hi Kate,

We will be happy to look into this. Could you provide us with a case
description for when you get this error? What kind of machine are you
running on?
Thanks!


Reply to this email directly or view it on GitHub
<

https://github.com/PARALLELIO/ParallelIO/issues/14#issuecomment-143007167>

.


Reply to this email directly or view it on GitHub
<
https://github.com/PARALLELIO/ParallelIO/issues/14#issuecomment-143019048>
.

Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO


Reply to this email directly or view it on GitHub
#14 (comment)
.

@Katetc
Copy link
Contributor

Katetc commented Sep 28, 2015

Hi Kate,

We have a local linux cluster where I'd like to try to reproduce this problem, but I need a little more information. How are you configuring PIO to start? When you say you are going from 4 tasks to 2, how are you making that change? Any information you can give me about how best to try to reproduce this error would be very helpful. If you have code that you can tar up and send to me, that would be REALLY great, but I understand that can be difficult. My email is [email protected], if you want to directly contact me.

Thanks!
Kate

@Katetc
Copy link
Contributor

Katetc commented Nov 9, 2015

Update: It looks like some of the issues were caused by an incorrect configuration:
{{{
! PIO number of I/O tasks

 NIOTASKS = 1

PIO_STRIDE = 4
}}}
This caused a runtime seg fault but the reverse (PIO_STRIDE=1, NIOTASKS=4) worked. The question remains as to whether we want to support a configuration as above, which could, in theory, be done. However, it is not a high priority currently as this is not a common configuration. Issue will be moved to JIRA as a feature request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants