Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

null pointer in MinMax #1745

Closed
philip-davis opened this issue Sep 16, 2019 · 19 comments
Closed

null pointer in MinMax #1745

philip-davis opened this issue Sep 16, 2019 · 19 comments
Assignees

Comments

@philip-davis
Copy link
Collaborator

I'm doing a large streaming file run with gray-scott coupled to pdf_calc on Theta, and I am sometimes (not always) seeing pdf_calc crash with many of the following strangely-formatted errors:

terminate called after throwing an instance of 'terminate called after throwing an instance of 'terminate called after throwing an instance of 'std::invalid_argument'
std::invalid_argument'
std::invalid_argument'
  what():  ERROR: found null pointer in call to Variable<T>::MinMax

This is using BP4 with the following configuration (I have combined SST and BP4 parameters for reuse:

    <io name="SimulationOutput">
        <engine type="BP4">
            <!-- SST engine parameters -->
            <parameter key="RendezvousReaderCount" value="1"/>
            <parameter key="QueueLimit" value="1"/>
            <parameter key="QueueFullPolicy" value="Block"/>
            <!-- BP4/SST engine parameters -->
            <parameter key="OpenTimeoutSecs" value="900"/>
            <parameter key="BeginStepPollingFrequencySecs" value="1" />
            <parameter key="SubStreams" value="512"/>
        </engine>
    </io>

This occurs on the first timestep of the reader. The writer runs to completion. For reference, here is some of the pdf_calc code that precedes the MinMax call:
https://github.com/pnorbert/adiosvm/blob/75bf69b13638f7c67981f43d269d2a19e269da20/Tutorial/gray-scott/analysis/pdf_calc.cpp#L182-L216

@philip-davis
Copy link
Collaborator Author

I don't see any errors before this one.

@philip-davis
Copy link
Collaborator Author

This is on the /projects directory of Theta, so it is a Lustre file system.

@JasonRuonanWang
Copy link
Member

@philip-davis Is this happening only with BP4 engine? Or SST as well? If it's only BP4 then probably either @pnorbert or @lwan86 is the best person to ask.

@JasonRuonanWang
Copy link
Member

But even if it's only using BP4 engine, I would still like to ask some further questions just for them to consider. Were you doing staging through files, or purely files? Could you help verify if it happens only in one of the two cases? Or does it happen in both cases?

@williamfgc
Copy link
Contributor

@philip-davis thanks for reporting this. If it helps, the error states that the variable is null (false in our API). It'd be good practice to check the variable status after a call to InquireVariable, it would make the code safer. Why is not valid is a matter of debugging and getting the stream variable for that step.

@philip-davis
Copy link
Collaborator Author

@JasonRuonanWang Sorry to assign you incorrectly. Only BP4. I do not see this if I do post-processing, i.e. let the simulation run to completion, then run the analysis.

@williamfgc Unfortunately, I can't inspect the bp file with bpls (I don't get any output when I try, other than Attempting to use an MPI routine before initializing MPICH. I assume this means that my file is corrupted in some way, which might be related to the failure I'm experiencing. How does one check this, since it is the private member m_Variable that is null, rather than the API object itself?

@germasch
Copy link
Contributor

@philip-davis If the bp4 output is not too large, do you mind sharing it so I can take a look why bpls crashes?

@philip-davis
Copy link
Collaborator Author

The data is too big (1.3TB). Would the metadata be useful? It's only about 15MB total.

@pnorbert
Copy link
Contributor

pnorbert commented Sep 16, 2019 via email

@philip-davis
Copy link
Collaborator Author

issue.tar.gz

Here is a tar of all the relevant configuration/metadata files. This is 4096 simulation processes on 128 nodes and 128 analysis processes on 32 nodes. The simulation domain is 2048^3.

<parameter key="OpenTimeoutSecs" value="900"/>
<parameter key="BeginStepPollingFrequencySecs" value="1" />

@philip-davis
Copy link
Collaborator Author

512 substreams

@philip-davis
Copy link
Collaborator Author

I have now seen the same thing in Cori with 256 writers and 8 readers.

@williamfgc
Copy link
Contributor

Unfortunately, I can't inspect the bp file with bpls (I don't get any output when I try, other than Attempting to use an MPI routine before initializing MPICH

Hi @philip-davis, please open a separate issue as this seems to be related to how the MPI changes affect bpls on Cori. We'll probably need a lot more info to understand and debug this.

How does one check this, since it is the private member m_Variable that is null, rather than the API object itself?

Good question, the API object has the bool operator so it can be checked directly as if(variable) to see if it's a valid object. Thanks!

@pnorbert
Copy link
Contributor

Just for a note: the BP file is fine, so the writer is okay. Some unexpected condition arises in the reader.

@pnorbert
Copy link
Contributor

@philip-davis Can you please try #1772 on Theta or Cori? I wonder if this fix for not handling timeouts correctly already fixes this bug. I saw the same error on non-first steps on my VM and this PR fixes that. But my job on Theta is not going anywhere, so I cannot test whether it fixes your issue for first-step.

#1773 should fix this issue problem once and for all, but it is a complicated implementation to fix a problem that I imagined and may or may not be an actual problem. I am curious if #1772 is enough in itself to make your issues go away. Thanks.

@philip-davis
Copy link
Collaborator Author

@pnorbert On Theta, I am seeing:

data[0] = 1 is out of [min,max] = [-1.79768e+308,2.84e-307]
 data[1] = 1 is out of [min,max] = [-1.79768e+308,2.84e-307]
 data[2] = 1 is out of [min,max] = [-1.79768e+308,2.84e-307]
 data[3] = 1 is out of [min,max] = [-1.79768e+308,2.84e-307]
 data[4] = 1 is out of [min,max] = [-1.79768e+308,2.84e-307]
...

This goes on for millions of lines. I see this with a non-streaming run as well. I am going to try cori as well, but I am having trouble building this branch on cori for some reason.

@williamfgc
Copy link
Contributor

@pnorbert @philip-davis please try on your end if the current release branch solves the issue. Thanks!

@pnorbert
Copy link
Contributor

This issue was caused by multiple bugs. After #1773 the gray-scott example works fine in BP4 streaming mode. I tested it on Cori. Job on Theta has been in the queue for long.
Please confirm that it works for you.

@philip-davis
Copy link
Collaborator Author

Yes this is working with #1773.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants