Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quadratic increasing costs for Engine::Put() during one step when using the SST Engine #1731

Closed
franzpoeschel opened this issue Sep 11, 2019 · 11 comments

Comments

@franzpoeschel
Copy link
Contributor

While using ADIOS2 in PIConGPU for streaming IO, I noticed that during one step, each call to Engine::Put() took at least as long as the previous call. While investigating this, I found out the following:

  • The SST engine does not distinguish between sync and deferred mode (see and ), so each written dataset is instantly marshalled and written to an internal buffer.
    The total incoming data is not known up front, so reallocation becomes necessary.
  • Reallocation in the BP3 serializer (which is used by SST) intentionally overrides the STL GNU default power of 2 reallocation and instead enforces a linear behavior (see )

This means that data written during one step in the SST engine will be reallocated as many times as subsequent chunks are written. This is probably fine for BP3 where deferred workflows are encouraged that avoid this behavior, but this will not work for SST.

By uncommenting this line, I was able to avoid this issue for now.

@williamfgc
Copy link
Contributor

williamfgc commented Sep 11, 2019

@franzpoeschel your assumptions are correct, thanks for reporting this. As you guessed, we need to limit the power of 2 as it's something implementations decide upon (not standardized) and can create problems when large memory chunks are allocated. There are buffer settings you can try on the BP serializer. Specifically, the InitialBufferSize and BufferGrowthFactor. Let us know if it helps. Thanks!

@franzpoeschel
Copy link
Contributor Author

I'll check it out, thanks for the hint!

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Sep 13, 2019

I tested it out and it helped fix my issues. I think it would be helpful if the documentation mentioned that engines using the BP3 serializer also accept those parameters (or did I miss that?).
Also, the current solution boils down to tuning reallocation instead of avoiding it. In the long run, it might be desirable to support a deferred Put mode – is something like this planned?

@williamfgc
Copy link
Contributor

@franzpoeschel point taken, @eisenhauer @JasonRuonanWang maintain SST. Can you take a look at the docs? Thanks.

@JasonRuonanWang
Copy link
Member

@franzpoeschel Yes it is planned. We are working on replacing BP3 with BP4 in SST, and we will re-write many things during this process, including a true deferred mode on the writer side, and optimizations on the reader side as well.

@franzpoeschel
Copy link
Contributor Author

This is good to know, thanks!
I haven't used ADIOS1 myself, but from what I have heard, in ADIOS1 it was required to announce all chunks to store before storing the first one (correct me if I'm wrong on that).
While this style is more tedious to write in, it still has an advantage over the deferred put mode to solve the issue above: In a deferred put mode, all data needs to be allocated at the same time (at EndStep time), while by announcing all chunks beforehand, the SST engine can already allocate a correctly-sized buffer in the beginning and then accept incoming chunks one after another. I thought about solving this issue in memory-critical applications by using the initial buffer size parameter, but this parameter is global for one engine and cannot be changed across steps (or can it?).
Long story short: Is it possible (or planned) for memory-critical applications to avoid reallocations without having to give ADIOS2 all data at once, as would become necessary in deferred mode?

@williamfgc
Copy link
Contributor

@franzpoeschel your concerns are valid, the current issue is that the BP serializer inside SST is created and destroyed at every step, unlike inside the file engines. Refactoring should take care of this problem, since the allocated buffer with InitialBufferSize would be persistent across steps, thus avoiding frequent reallocations.

@franzpoeschel
Copy link
Contributor Author

So, the solution would be setting the buffer size once to the maximum needed size initially? If so, that's good to know. Thanks!

@pnorbert
Copy link
Contributor

pnorbert commented Oct 7, 2019

Suggestion: a new function after BeginStep, for buffer allocation. @ax3l can then handle this situation in PIConGPU.

@franzpoeschel
Copy link
Contributor Author

I think this is a good idea, since it would also cover engines that do not reuse buffers across steps.

@franzpoeschel
Copy link
Contributor Author

Coming back to this, I think that serialization engines that are persisted across steps (such as in BP4) are the most elegant way to solve this and specifying InitialBufferSize a sufficient way to give more hints to the backend. Since those methods allow me to avoid the problems described in the issue, I think I can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants