1. Memory Backends and Allocators

Memory Backend

A memory backend represents a contiguous array of bytes which was allocated in bulk from a memory allocator provided by the Operating System (OS). Typically, a Memory Backend allocates a large amount of virtual memory since memory is not truly allocated until it is first initialized. The only limitation is the size of the virtual address space (typically 2^64 bytes).

class MemoryBackend {
public:
  MemoryBackendHeader *header_;
  char *data_;
  size_t data_size_;
  bitfield32_t flags_;
}

We currently provide the following Memory Backends:

Mmap (private): uses the mmap() system call to allocate private, anonymous memory
Mmap (shared): uses the mmap() system call to allocate shared memory
Array: The user inputs an already-allocated buffer. Does not internally allocate any memory.

Memory Allocator

A memory allocator manages an array of contiguous data, typically provided by a Memory Backend. Memory allocators fundamentally provide two interfaces: allocate and free. SHM allocators return offsets from a memory backend, instead of raw pointers.

class Allocator {
  virtual OffsetPointer Allocate(size_t size);
  virtual void Free(OffsetPointer &p);
}

There are 4 different pointer offset types:

OffsetPointer: stores a 64-bit offset
Pointer: stores a 64-bit offset + allocator ID (64-bit)
AtomicOffsetPointer: stores a 64-bit offset using atomic instructions (guarantees memory coherency)
AtomicPointer: stores the 64-bit offset using atomic instructions, but stores the allocator ID regularly

Stack Allocator

The Stack Allocator has only an allocation function. When a memory region has been allocated, it can never again be truly freed. This allocator is used internally by other allocators and for debugging purposes.

Properties of this allocator

Thread-safe
Works well in cases where memory doesn't need to be re-used
Not general-purpose; primarily used internally by other allocators

Fixed Page Allocator

The Fixed Page Allocator tracks every unique page size and allows them to be reallocated. Many data structures such as lists and unordered_maps require only a few certain sizes of pages, and caching those sizes is optimal to their performance and resource utilization. Benchmarks have shown that this allocator performs at least 2x faster than a traditional malloc for workloads which require only a few page sizes to be allocated. However, this allocator does not support coalescing and can result in poor memory utilization where arbitrary page sizes are being allocated.

Properties of this allocator

NOT thread-safe
Works well when there are only a few unique pages being allocated
Not general-purpose; primarily used internally by other allocators which need complex data structures (e.g., unordered_map)

Scalable Page Allocator

The Scalable Page Allocator caches a few specific size of pages. The specific page sizes cached is configurable. Memory is initially divided evenly among each core. To support better coalescing, small page sizes are allocated in larger groups. E.g., a 64-byte page will cause a 4KB allocation to occur. Coalescing is performed when a certain percentage of the memory is free, but an allocation fails to find a suitable page size.

Properties of this allocator:

Thread-safe
Works well in workloads which allocate near the size of the cached pages
General-purpose; should get reasonable memory utilization and performance in many workloads

Creating a Backend and Allocator

The following example is in example/allocator.cc

#include <mpi.h>
#include <cassert>
#include "hermes_shm/data_structures/thread_unsafe/list.h"

struct CustomHeader {
  int data_;
};

int main(int argc, char **argv) {
  int rank;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  // Common allocator information
  std::string shm_url = "test_allocators";
  hipc::allocator_id_t alloc_id(0, 1);
  auto mem_mngr = HERMES_MEMORY_MANAGER;
  hipc::Allocator *alloc;
  CustomHeader *header;

  // Create backend + allocator
  if (rank == 0) {
    // Create a 64 megabyte allocatable region
    mem_mngr->CreateBackend<hipc::PosixShmMmap>(
      MEGABYTES(64), shm_url);
    // Create a memory allocator over the 64MB region
    alloc = mem_mngr->CreateAllocator<hipc::StackAllocator>(
      shm_url, alloc_id, sizeof(CustomHeader));
    // Get the custom header from the allocator
    header = alloc->GetCustomHeader<CustomHeader>();
    // Set custom header to 10
    header->data_ == 10;
  }
  MPI_Barrier(MPI_COMM_WORLD);

  // Attach backend + find allocator
  if (rank != 0) {
    mem_mngr->AttachBackend(hipc::MemoryBackendType::kPosixShmMmap, shm_url);
    alloc = mem_mngr->GetAllocator(alloc_id);
    header = alloc->GetCustomHeader<CustomHeader>();
  }
  MPI_Barrier(MPI_COMM_WORLD);

  // Verify header is equal to 10 in all processes
  assert(header->data_ == 10);

  // Finalize
  if (rank == 0) {
    std::cout << "COMPLETE!" << std::endl;
  }
  MPI_Finalize();
}

To execute the above code, do one of the following:

1. Initialize MPI

First, we initialize MPI and determine the "rank" of the current process.

int main(int argc, char **argv) {
  int rank;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

Each process is assigned a unique integer between 0 and the number of processes. Rank 0 is considered the ID of the first process, also known as the rank-0 or root process.

2. Define constants shared among processes

The following code represents information which is common among each process.

  // Common allocator information
  std::string shm_url = "test_allocators";
  hipc::allocator_id_t alloc_id(0, 1);
  auto mem_mngr = HERMES_MEMORY_MANAGER;
  hipc::Allocator *alloc;
  CustomHeader *header;

The shm_url is used to determine the location of the shared memory. The allocator ID is also the same among each process.

3. Initialize the memory allocator + backend on the Rank-0 Process

The following block of code initializes the shared memory allocator + backend in the rank-0 process.

  if (rank == 0) {
    // Create a 64 megabyte allocatable region
    mem_mngr->CreateBackend<hipc::PosixShmMmap>(
      MEGABYTES(64), shm_url);
    // Create a memory allocator over the 64MB region
    alloc = mem_mngr->CreateAllocator<hipc::StackAllocator>(
      shm_url, alloc_id, sizeof(CustomHeader));
    // Get the custom header from the allocator
    header = alloc->GetCustomHeader<CustomHeader>();
    // Set custom header to 10
    header->data_ == 10;
  }
  MPI_Barrier(MPI_COMM_WORLD);

By checking if the rank==0, the code will only be executed on the rank-0 process. This avoids accidently repeating initialization in each process. In this example, we create a 64MB shared-memory backend which relies on the StackAllocator for managing memory. The MPI_Barrier ensures that all processes wait until the shared-memory has been initialized.

4. Attach to the backend in all other ranks

After the memory has been initialized, all other processes (i.e., ranks) will attach the shared-memory backend. This will internally discover the allocator which is managing the backend and mount it in the HERMES_MEMORY_MANAGER.

  if (rank != 0) {
    mem_mngr->AttachBackend(hipc::MemoryBackendType::kPosixShmMmap, shm_url);
    alloc = mem_mngr->GetAllocator(alloc_id);
    header = alloc->GetCustomHeader<CustomHeader>();
  }
  MPI_Barrier(MPI_COMM_WORLD);

5. Remaining code

The remaining code in the example is jut error checking and finalization routines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly