Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose internal buffers to writers #901

Merged
merged 20 commits into from
Apr 21, 2021
Merged

Conversation

franzpoeschel
Copy link
Contributor

@franzpoeschel franzpoeschel commented Jan 14, 2021

Based on #904 (see below for the reason), for comparison: franzpoeschel/openPMD-api@topic-pipe-executable...franzpoeschel:topic-span

This makes the functionality provided by adios2::Variable::Span<T> available to users of the openPMD API.

Use case: If in the writing application, write buffers are newly allocated right before RecordComponent::storeChunk() (as opposed to passing buffers to storeChunk() that already exist in the application), this PR allows the buffer to be allocated by the openPMD backend (e.g. in ADIOS2: provide a view into the serialization buffers). In such a case, this may avoid memcopies, ideally cutting memory usage in half.

Possible benefitting applications:

  • PIConGPU: Before storing record components, the openPMD plugin in PIConGPU converts data from the PIConGPU-internal AoS-style representation into the openPMD-style SoA-style representation. Using this PR, data can then be converted directly into backend buffers.
    UPDATE: I've implemented this in a PIConGPU branch, this diff gives an example on usage.
  • openPMD-pipe: Data can be directly loaded from the reading backend into the writing backend, avoiding the detour through user-space buffers.

TODO:

  • Parallel testing, more testing. Note: Our use of this in openpmd-pipe very exhaustively tests this in parallel.
  • Generic fallback implementation for backends that do not support this mode
  • ADIOS2 implementation
  • Failing test: I can reproduce this with ADIOS 2.6.0, ADIOS 2.7.0 does not have this issue. Hence, this needs to wait until we bump the required ADIOS2 version. nah, that one was on me actually
  • There seems to be some failure in ADIOS2 for larger datasets (openpmd-pipe.pying the HDF5 git sample crashes)
  • Add IOTask to ADIOS1 queues
  • Python bindings
  • Documentation, cleanup
  • Fix handling of scalar record components
  • Fix intermittent resizing of serialization buffer in ADIOS2
  • Make old spans reusable?
  • Merge Add openpmd-pipe.py command line tool #904 first
  • This one reshuffles parts of our flushing logic. Properly document and check that things are fine.
  • Allow deferred calls to avoid buffer reallocations in ADIOS2.
  • I've rebased this onto Particle Patches: Do not emplace patch records if they don't exist in the file being read #945 to ease benchmarking with openpmd-pipe. So, merge that first.
  • Avoid buffer reallocations in ADIOS – impossible without explicit support from ADIOS
  • Should we update the Span interface in C++ to be more similar to the one in Python? Maybe this silent buffer reallocation thing is a bit too much to hide it automatically behind a Span interface.. ?

@franzpoeschel franzpoeschel force-pushed the topic-span branch 3 times, most recently from 5d6a2f1 to a5b5413 Compare January 15, 2021 10:15
@franzpoeschel franzpoeschel changed the title Expose internal buffers for writers Expose internal buffers to writers Jan 18, 2021
@franzpoeschel franzpoeschel force-pushed the topic-span branch 8 times, most recently from 9c830a3 to 5586e73 Compare January 22, 2021 18:48
@ax3l ax3l self-requested a review January 22, 2021 22:28
@franzpoeschel franzpoeschel force-pushed the topic-span branch 2 times, most recently from c823946 to d3e8f15 Compare February 22, 2021 14:22
@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Feb 23, 2021

I've benchmarked this for a PIConGPU simulation that dumped 124GB of data in 4 IO passes from 1 GPU (¹).
Note that PIConGPU already saves memory by reusing store buffers and flushing them to the backend for each single dataset anew, reducing the amount of improvement attainable by using the span-based API.

The current memory usage, as profiled by KDE Heaptrack, peaks at 87.1GB:
Bildschirmfoto von 2021-02-23 14-02-57
Simulation times for the same run, but without Heaptrack:

PIConGPUVerbose INPUT_OUTPUT(32) | openPMD: IO plugin ran for 32sec 737msec (average: 35sec 126msec)
calculation  simulation time:  3min 43sec 439msec = 223 sec
full simulation time:  4min 54sec 746msec = 294 sec

After using this PR in PIConGPU, it peaks at 85.4GB, saving 1.7 ~ 1.8GB of heap memory:
Bildschirmfoto von 2021-02-23 14-01-58
Simulation times for the same run, but without Heaptrack:

PIConGPUVerbose INPUT_OUTPUT(32) | openPMD: IO plugin ran for 28sec 372msec (average: 31sec 969msec)
calculation  simulation time:  3min 30sec 703msec = 210 sec
full simulation time:  5min 24sec 709msec = 324 sec

So, we saved ~3 seconds of memory allocation time per run of the IO plugin.

The remaining memory spikes stem from the DoubleBuffer allocation strategy in the openPMD plugin.
I expect starker differences in our little openpmd-pipe tool, will come back to that.

[1] KelvinHelmholtz example called with picongpu -s 150 -d 1 1 1 -g 256 512 128 --openPMD.period 50 --openPMD.file dump --openPMD.infix NULL --openPMD.ext bp --openPMD.json '{ "adios2": { "engine": { "usesteps": true, "parameters": { "InitialBufferSize": "34Gb", "Profile": "On" } }, "dataset": { "operators": [ ] } } } ' --versionOnce

@franzpoeschel franzpoeschel force-pushed the topic-span branch 2 times, most recently from 1368ef8 to 9db8694 Compare February 23, 2021 15:24
@franzpoeschel franzpoeschel force-pushed the topic-span branch 2 times, most recently from 7da46a0 to f8d2366 Compare March 2, 2021 16:03
include/openPMD/RecordComponent.hpp Show resolved Hide resolved
include/openPMD/RecordComponent.hpp Outdated Show resolved Hide resolved
include/openPMD/RecordComponent.hpp Outdated Show resolved Hide resolved
include/openPMD/IO/AbstractIOHandler.hpp Show resolved Hide resolved
UserFlush,
/**
* Default mode, flush everything that can be flushed
* Does not need to uphold user-level guarantees about clearing and filling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the 2nd sentence yet:
Are you saying that this cannot check the API contract we have with the user, aka that the user must have given us valid buffers and they are ready to use at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've clarified documentation now by introducing a concept of flush points. Difference between both modes is that UserFlush defines a flush point and FlushEverything InternalFlush does not.

Comment on lines 3321 to 3471
* Hijack the functor that is called for buffer creation if the
* backend doesn't support the task to see whether the backend
* did support it or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please split this into two sentences? It's a bit hard to parse for me :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this lambda a helper function? Looks like a generally useful implementation that people would use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please split this into two sentences? It's a bit hard to parse for me :)

Will do

Should we make this lambda a helper function? Looks like a generally useful implementation that people would use?

I'm not sure about it.. This sidechannels the return value (the boolean) as a value caught-by-reference. Can we write this in a way that would actually be a good API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Done

include/openPMD/RecordComponent.hpp Outdated Show resolved Hide resolved
@ax3l
Copy link
Member

ax3l commented Apr 5, 2021

@franzpoeschel can you please rebase this one? :)

Copy link
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks already in great shape! Notes inline :)

include/openPMD/IO/AbstractIOHandlerImpl.hpp Outdated Show resolved Hide resolved
docs/source/usage/workflow.rst Show resolved Hide resolved
docs/source/usage/workflow.rst Outdated Show resolved Hide resolved
docs/source/usage/workflow.rst Outdated Show resolved Hide resolved
examples/12_span_write.cpp Show resolved Hide resolved
examples/12_span_write.py Outdated Show resolved Hide resolved
@@ -1041,7 +1068,9 @@ namespace detail
std::vector< std::unique_ptr< BufferedAction > > m_buffer;
std::map< std::string, BufferedAttributeWrite > m_attributeWrites;
std::vector< BufferedAttributeRead > m_attributeReads;
std::vector< std::unique_ptr< BufferedAction > > m_alreadyEnqueued;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add further doxygen strings to these member variables?
The amount of member variables indicates we are doing quite involved things here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will do

include/openPMD/RecordComponent.tpp Outdated Show resolved Hide resolved
include/openPMD/Span.hpp Outdated Show resolved Hide resolved
{
switch( iterationEncoding() )
IOHandler()->m_flushLevel = level;
try
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below: why do we use exception handling at this point and rethrow another exception?
I am just concerned that we absorb legit exceptions from backends, e.g. if a low-level ADIOS/HDF5 operation fails.

The try-catch pattern on this high level looks a bit duck-tape-y to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this is to emulate the same thing that the Python or Java finally keywords do. This does not absorb exceptions from a layer below, this construct does some cleanup and passes the exception on:

Rethrows the currently handled exception. Abandons the execution of the current catch block and passes control to the next matching exception handler (but not to another catch clause after the same try block: its compound-statement is considered to have been 'exited'), reusing the existing exception object: no new objects are made.

(https://en.cppreference.com/w/cpp/language/throw)

The purpose is to ensure that the base state is restored even if an exception is thrown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks!

Copy link
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great PR, this is ready to go! 🚀 ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants