ornladios · pnorbert · Mar 22, 2022 · Mar 22, 2022
diff --git a/docs/user_guide/source/advanced/memory_management.rst b/docs/user_guide/source/advanced/memory_management.rst
@@ -0,0 +1,30 @@
+###################
+ Memory Management
+###################
+
+BP4 buffering
+-------------
+
+BP4 has a simple buffering mechanism to provide ultimate performance at the cost of high memory usage: all user data (passed in `Put()` calls) is buffered in one contiguous memory allocation and writing/aggregation is done with this large buffer in `EndStep()`. Aggregation in BP4 uses MPI to send this buffer to the aggregator and hence maintaining two such large buffers. Basically, if an application writes `N` bytes of data in a step, then BP4 needs approximately `2xN` bytes extra memory for buffering. 
+
+A potential performance problem is that BP4 needs to extend the buffer occasionally to fit more incoming data (more `Put()` calls). At large sizes the reallocation may need to move the buffer into a different place in memory, which requires copying the entire existing buffer. When there are GBs of data already buffered, this copy will noticably decrease the overall observed write performance. This situation can be avoided if one can guess a usable upper limit to how much data each process is going to write, and telling this to the BP4 engine through the **InitialBufferSize** parameter before `Open()`.
+
+Another potential problem is that reallocation may fail at some point, well before the limits of memory, since it needs a single contiguous allocation be available.
+
+BP5 buffering
+-------------
+
+BP5 is designed to use less memory than BP4. The buffer it manages is a list of large chunks. The advantages of the list of chunks is that no reallocation of existing buffer is needed, and that BP5 can potentially allocate more buffer than BP4 since it requests many smaller chunks instead of a large contiguous buffer. In general, chunks should be as big as the system/application can afford, up to **2147381248** bytes (almost but less than 2GB, the actual size limit POSIX write() calls have). Each chunk will result in a separate write call, hence minimizing the number of chunks is preferred. The current default is set to 128MB, so please increase this on large computers if you can and if you write more than that amount of data per process, using the parameter **BufferChunkSize**. 
+
+Second, BP5 can add a large user variable as a chunk to this list without copying it at all and use it directly to write (or send to aggregator). `Put(..., adios2::Mode::Deferred)` will handle the user data directly, unless its size is below a threshold (see parameter **MinDeferredSize**). 
+
+.. note::
+    Do not call `PerformPuts()` when using BP5, because this call forces copying all user data into the internal buffer before writing, eliminating all benefits of zero-copy that BP5 provides. 
+
+Third, BP5 is using a shared memory segment on each compute node for aggregation, instead of MPI. The best settings for the shared memory is 4GB (see parameter **MaxShmSize**), enough place for two chunks with the POSIX write limit. More is useless but can be smaller if a system/application cannot allow this much space for aggregation (but there will be more write calls to disk as a result).
+
+Span object in internal buffer
+------------------------------
+
+Another option to decrease memory consumption is to pre-allocate space in the BP4/BP5 buffer and then prepare output variables directly in that space. This will avoid a copy and the need for doubling memory for temporary variables that are only created for output purposes. This Span feature is only available in C++. 
+See the `Span()` function in :ref:`C++11 Engine class`  ../api_full/api_full.html#engine-class
diff --git a/docs/user_guide/source/advanced/reduction.rst b/docs/user_guide/source/advanced/reduction.rst
diff --git a/docs/user_guide/source/api_full/cxx11.rst b/docs/user_guide/source/api_full/cxx11.rst
@@ -71,6 +71,8 @@ The following section provides a summary of the available functionality for each
    :members:
 
 
+.. _C++11 Engine class:
+
 :ref:`Engine` class
 -------------------
 

diff --git a/docs/user_guide/source/engines/bp5.rst b/docs/user_guide/source/engines/bp5.rst
@@ -78,7 +78,7 @@ This engine allows the user to fine tune the buffering operations through the fo
 
    #. **BufferVType**: *chunk* or *malloc*, default is chunking. Chunking maintains the buffer as a list of memory blocks, either ADIOS-owned for sync-ed Puts and small Puts, and user-owned pointers of deferred Puts. Malloc maintains a single memory block and extends it (reallocates) whenever more data is buffered. Chunking incurs extra cost in I/O by having to write data in chunks (multiple write system calls), which can be helped by increasing *BufferChunkSize* and *MinDeferredSize*. Malloc incurs extra cost by reallocating memory whenever more data is buffered (by Put()), which can be helped by increasing *InitialBufferSize*. 
 
-   #. **BufferChunkSize**: (for *chunk* buffer type) The size of each memory buffer chunk, default is 128MB but it is worth increasing up to 2GB if possible for maximum write performance.
+   #. **BufferChunkSize**: (for *chunk* buffer type) The size of each memory buffer chunk, default is 128MB but it is worth increasing up to 2147381248 (a bit less than 2GB) if possible for maximum write performance.
 
    #. **MinDeferredSize**: (for *chunk* buffer type) Small user variables are always buffered, default is 4MB. 
 

diff --git a/docs/user_guide/source/index.rst b/docs/user_guide/source/index.rst
@@ -34,7 +34,7 @@ Funded by the `Exascale Computing Project (ECP) <https://www.exascaleproject.org
    :caption: Advanced Topics
 
    advanced/aggregation
-   advanced/reduction
+   advanced/memory_management
    advanced/gpu_aware
    advanced/plugins
 

diff --git a/docs/user_guide/source/introduction/whatsnew.rst b/docs/user_guide/source/introduction/whatsnew.rst
@@ -10,12 +10,14 @@ Important changes to the API
     fails outside BeginStep/EndStep sections. You need to modify your Open() statement to use the random-access mode if your
     code is accessing all steps in an existing file.
   * **adios2::ADIOS::EnterComputationBlock()**, **adios2::ADIOS::ExitComputationBlock()** are hints to ADIOS that a process is in a computing (i.e. non-communicating) phase. BP5 asynchronous I/O operations can schedule writing during such phases to avoid interfering with the application's own communication. 
+  * GPU-aware I/O supports passing device-memory data pointers to the ADIOS2 Put() function, and ADIOS2 will automatically download data from the device during I/O. Alternatively, an extra member function of the Variable class, **SetMemorySpace(const adios2::MemorySpace mem)** can explicitly tell ADIOS2 whether the pointer points to device memory or host memory.
 
 New features
 
    * BP5 data format and engine. This new engine optimizes for many variables and many steps at large scale. 
-     It is also more memory efficient than previous engines. 
-   * Plugin architecture to support external *engines* and *operators* outside the ADIOS2 installation. 
+     It is also more memory efficient than previous engines, see :ref:`BP5`. 
+   * Plugin architecture to support external *engines* and *operators* outside the ADIOS2 installation, see :ref:`Plugins` 
+   * GPU-Aware I/O for writing data from device memory, using CUDA (NVidia GPUs only), see :ref:`GPU-aware I/O`
 
 Other changes
-Original file line number
+Diff line change
@@ Expand Up @@
        :members:
+    .. _C++11 Engine class:
     :ref:`Engine` class
     -------------------
@@ Expand Down @@