Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large memory consumption in BufferSTL when using PerformPuts and Flush instead of Begin/EndStep (BP4) #1891

Closed
franzpoeschel opened this issue Nov 27, 2019 · 5 comments

Comments

@franzpoeschel
Copy link
Contributor

During performance evaluations of the ADIOS2 backend in the openPMD API, I noticed an unusually large heap memory consumption in non-streaming workflows (in this case the BP4 engine) and traced it back to memory not being freed from the marshalling buffer (BufferSTL).

Since openPMD's iterations cannot be easily modeled using the ADIOS2 step concept, this backend only uses steps for streaming engines. For disk-based engines, we use Engine::PerformPuts and Engine::Flush instead. The documentation for the latter method says:

Manually flush to underlying transport to guarantee data is moved

This suggests that data should not be present in ADIOS after this call. (?) The figure below shows the memory trace from a small example, writing 30 openPMD iterations from PIConGPU to disk with several flushes per iteration. This memory buildup is not visible when using ADIOS steps.

Bildschirmfoto vom 2019-11-27 12-01-26

Did I understand the functionality of Engine::Flush correctly? In that case, calling it should free the buffer. If not, is there a suggested alternative to avoid using ADIOS steps without building up heap memory usage in the described way?

@williamfgc
Copy link
Contributor

@franzpoeschel first of all, thanks for providing numbers as it always enriches the conversation. The memory growth you're seeing is intended within a single adios2 step, as you reported you are only using one adios2 step.
A few things:

  1. Are you using Put in sync mode? Usually memory grows as shown in your graph.

  2. Flush calls the underlying transports (e.g. POSIX write, fstream write+flush) so it guarantees your data is moved as the docs say. It shouldn't do anything to the memory allocation per-se, but reset the buffer position to 0 for the next Put to avoid growth, if you're not seeing this then it might be a bug.

  3. Can you try the BP3 engine? BP4 is designed for use-cases with several steps and large metadata (variables, ranks, etc.). You'll probably not see any differences in the allocation, but it would help understand with real numbers.

  4. Out of curiosity, what's the reason openPMD steps can't be mapped to adios2 steps? The concept of steps in adios2 is I/O steps or batches of variables, it could be anything (timestep, iteration number, acquired data, etc.)

If not, is there a suggested alternative to avoid using ADIOS steps without building up heap memory usage in the described way?

adios2 Put can do two things in terms of memory management:

  1. Buffer your data by copying your memory to an internal buffer (sync reallocates per variable, deferred allocates at once).
  2. Provide direct buffer access (zero-copy) with a span signature, that way you can use the adios2 buffer for computation or to extract contiguous memory from non-contiguous memory structures (e.g. table subsets).

Let me know if you have any questions. Thanks!

@franzpoeschel
Copy link
Contributor Author

1. Are you using Put in sync mode? Usually memory grows as shown in your graph.

I use deferred mode, with intermittent calls to PerformPuts and Flush to write data to disk at appropriate locations.

2. Flush calls the underlying transports (e.g. POSIX write, fstream write+flush) so it guarantees your data is moved as the docs say. It shouldn't do anything to the memory allocation per-se, but reset the buffer position to 0 for the next Put to avoid growth, if you're not seeing this then it might be a bug.

My current guess would be that this is the case. I regularly call Engine::Flush and the size of data between two such calls does not grow in the way that the memory usage profile shows.
I don't see this growth when using the SST engine (where the ADIOS2 backend in openPMD uses steps to define streaming packets). I am currently running a test case with the same parameterization using the SST engine and expect results within the next hour, I will upload them then for comparison.

3. Can you try the BP3 engine? BP4 is designed for use-cases with several steps and large metadata (variables, ranks, etc.). You'll probably not see any differences in the allocation, but it would help understand with real numbers.

I experience the same issue in BP3.
Bildschirmfoto vom 2019-11-27 13-21-13

4. Out of curiosity, what's the reason openPMD steps can't be mapped to adios2 steps? The concept of steps in adios2 is I/O steps or batches of variables, it could be anything (timestep, iteration number, acquired data, etc.)

The openPMD standard defines an iteration as a self-contained group in the openPMD hierarchy. As a result, the "conceptually same" dataset in two different iterations will be modeled as two physically completely independent datasets, e.g.

  int32_t   /data/8/particles/e/positionOffset/z         {163840}
  int32_t   /data/7/particles/e/positionOffset/z         {163840}              

(It is not necessary that both datasets have the same dimensions as they do in this example)
The approach followed by ADIOS steps would be to define both as one dataset that comes in different "versions". As a result, using ADIOS steps to define openPMD iterations would require very pervasive refactorings in the openPMD API, or might not even be possible to fully implement the openPMD standard.
As you say, in ADIOS2 a step can be anything, but using them for openPMD iterations would give them a very clear and less generic meaning in the openPMD API, resulting in a coupling of semantical interpretation (openPMD steps) and physical data description (packets in streaming engines). This can result in issues for example when a simulation produces more data per iteration than fits on host memory. By using ADIOS steps only to describe the physical data layout, we are free to send one iteration in several packets or also several iterations in one packet.

Thank you for your fast reply!

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Nov 27, 2019

It seems, I made an error yesterday when investigating the behavior of the SST engine. In my latest test run, that used the SST engine – where I definitely use ADIOS steps – I see the same issue, but with slightly lower numbers.

Bildschirmfoto vom 2019-11-27 15-12-23

I will further investigate this, but do you have an idea what could be the issue?

EDIT: I think, this measurement was simply due to having no queue limit and a slow reader. I will run the test again.

@williamfgc
Copy link
Contributor

@franzpoeschel thanks for the detailed explanation, it's very helpful in our understanding. SST uses the BP3 buffering strategy per step, so no surprises with your results. The only caveat is that the memory is reallocated at every step which is discussed in #1731.

I use deferred mode, with intermittent calls to PerformPuts and Flush to write data to disk at appropriate locations.

Makes sense, sync mode is basically deferred+PerformPuts, you just have a few Put calls in deferred mode+PerformPuts.

My current guess would be that this is the case. I regularly call Engine::Flush and the size of data between two such calls does not grow in the way that the memory usage profile shows.

Is this helping your problem? The trade-off is that you'd be calling the filesystem more often, but seems that you have to since you're memory bound. Another option in BP-based engines is to set the buffer growth strategy check InitialBufferSize in the docs. The only issue is that you'd have to come up with some sort of heuristics to set the appropriate InitialBufferSize. There is a few more options you can control, but allocating at once might be more helpful.

As a result, the "conceptually same" dataset in two different iterations will be modeled as two physically completely independent datasets, e.g.

Not sure if it helps openPMD model, but the only things that must remain constant across adios2 steps for a variable are the name, type, ShapeID, and number of dimensions. Basically, those are the variable invariants in adios2, as you know attributes are only loosely coupled through the IO factory. Everything else can change, not even per step, but per block (a single call to Put), dimension values can change, variables might not exist at certain steps, number of blocks per step, etc.

Hope this helps. Let me know if you have more questions. Basically, adios2 provides a few options to control your I/O workflow, but it's subject to the physical limits of your data, metadata and system resources (memory, network, file disks, etc).

@franzpoeschel
Copy link
Contributor Author

I think, I should come back to report on this one.
We don't use Engine::Flush() any longer due to moving to step-based data consumption also for file-based workflows. With PRs #2516, #2534 merged and issue #2532 closed, we now observe constant heap memory usage when using steps. (Setting InitialBufferSize stays necessary for BP3 and SST and strongly recommended for BP4).

I used a little example program to check the behavior of Engine::Flush() nowadays with ADIOS 2.7.0:

#include <adios2.h>
#include <numeric>
#include <vector>

int main( int argsc, char ** argsv )
{
    constexpr size_t length = 10000;
    std::string engine_type = "bp4";
    if( argsc > 1 )
    {
        engine_type = argsv[ 1 ];
    }

    adios2::ADIOS adios;
    adios2::IO IO = adios.DeclareIO( "IO" );
    IO.SetEngine( engine_type );
    adios2::Engine engine = IO.Open( "no_steps.bp", adios2::Mode::Write );

    using datatype = double;
    std::vector< datatype > streamData( length );
    std::iota( streamData.begin(), streamData.end(), 0. );

    for( unsigned step = 0; step < 1000; ++step )
    {
        auto variable = IO.DefineVariable< datatype >(
            "var" + std::to_string( step ),
            { length },
            { 0 },
            { length },
            /* constantDims = */ true );

        engine.Put( variable, streamData.data() );
        // move to ADIOS buffer
        engine.PerformPuts();
        // move to file
        engine.Flush();
    }
    engine.Close();
}

Memory consumption peaks at 1.4MB:
Bildschirmfoto von 2021-02-22 17-58-18
The resulting file contains only the last variable:

> bpls no_steps.bp/
  double   var999  {10000}

I don't know whether this is intended behavior, but since it is no longer relevant for our usage, I think that I can close the issue anyway. I haven't tested any streaming engines since we make the use of steps mandatory for streaming in openPMD anyway.

Without calling Engine::Flush(), memory consumption will increase continually:
Bildschirmfoto von 2021-02-22 18-02-49

Not sure if it helps openPMD model, but the only things that must remain constant across adios2 steps for a variable are the name, type, ShapeID, and number of dimensions. Basically, those are the variable invariants in adios2, as you know attributes are only loosely coupled through the IO factory. Everything else can change, not even per step, but per block (a single call to Put), dimension values can change, variables might not exist at certain steps, number of blocks per step, etc.

An iteration schema that makes better use of ADIOS steps is currently WIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants