-
Notifications
You must be signed in to change notification settings - Fork 20
Component Overview
These notes are meant to provide a brief summary of ASGarD’s major code components. The goal in writing these is to document the code, but also to provide a place where math/physics professionals can add their insight to improve the team’s overall understanding of how the code works. So, this is not a closed writeup – please use your expertise to add detail and fix misconceptions.
“Adaptivity” describes ASGarD’s capability to add/remove element grid points as needed to more efficiently and accurately model a solution. Elements responsible for relatively large coefficients in the solution may be “refined” (have child elements added to the grid). Elements whose coefficients are very small may be “coarsened” (removed from the grid). The thresholds for coarsening and refinement are controlled by an input argument to ASGarD (--thresh
).
- class
distributed_grid
The distributed_grid
is composed of the element table (from elements
component) and the distribution_plan
(distribution
component) that maps each MPI rank in the simulation to the elements for which the rank must compute coefficients. In short, this class bundles information about the elements and how they are distributed, and provides functions to adapt the grid given the solution vector with elements’ coefficient values.
The main routines driving this functionality (distributed_grid::refine
and distributed_grid::coarsen
) refine/coarsen the element grid by adding/deleting elements from its member elements::table
, and then redistributing responsibility for the new element grid across ranks using functions in the distribution
component.
-
It is reasonable to encapsulate the element table with the distribution plan, since the adaptivity functions require altering the grid and rebalancing the problem across ranks, but providing
const &
getters for these members is not ideal. It would be better to provide a stricter interface allowing outside components to access the functionality they need without exposing (for read) the objects themselves. -
Currently, there is no limit (beyond a level cap) on refinement/coarsening. This introduces two related problems; refinement to the extent that some rank’s portion of the solution vector spills out of RAM (virtually unlimited growth) and coarsening to the extent that fewer elements remain than ranks. We will need to gracefully handle both situations.
Most computations within ASGarD are performed in wavelet space, but the PDEs we want to solve are defined in real space. So, we require a means of transforming vectors/matrices into wavelet space. Most of this functionality is provided in the transformations
component, but the basis
component handles the operator required for transformations. Originally, this operator was explicitly formed (as a matrix), with application via matrix multiplication as appropriate to transform values. However, because the transform matrix is sparse with large zero regions and a regular pattern of dense blocks, we now form and store only the dense blocks. Functions are provided in the basis
component to apply the transform matrix using these blocks, skipping the unnecessary arithmetic large zero regions would introduce.
- function
operator_two_scale
Deprecated. Explicitly forms the transform matrix. Uses generate_multi_wavelets
, which is conceptually private but exposed in the interface for testing purposes.
- class
wavelet_transform
Constructor forms the dense blocks of the transform matrix. apply
methods provide the result of matrix multiplication using these dense blocks.
The dense blocks required for the transform matrix are relatively small and their formation is not costly. So, rather than constructing one instance of this class and passing it around for transforms as needed, we could create instances as needed at function scope to declutter APIs requiring wavelet_transform
parameters.
Both the deprecated operator_two_scale
and generate_multi_wavelets
functions are lengthy and obscure in places. The C++ developers are not necessarily familiar with the underlying math performed by these functions; this makes optimization or clarification of the MATLAB code that served as a template for these routines a challenge. There is an opportunity for someone with math insight to clarify the code here.
“Kronmult” (Kronecker product of matrices followed by matrix-vector multiplication , Ax=b
, where A
is a tensor product) is one of the core computational kernels in ASGarD. We often avoid explicit formation of the Kronecker product by instead multiplying each Kronecker operand (in turn) directly to the vector x
conceptually reshaped as a small matrix. High dimensional/high resolution problems may require O(10^6)
or more small kronmults to move the simulation forward at each timestep.
A prior implementation of the time stepping code performed these many kronmults by constructing dimension-many lists (batches) of matrix-multiplication operands and invoking vendor-supplied batched_gemm
routines. The batch
component was designed to support constructing, storing, and passing pointer lists for batched_gemm
.
The batch
component is no longer used to drive explicit timestepping in time_advance
; the kronmult
component now serves this purpose. However, some functions within the component are used for implicit timestepping, and the “kronmult as batched gemm” technique is employed for wavelet to realspace transforms in the transformations
component.
- class
batch
Provides storage for fixed-size list of pointers to same-sized matrices for batched_gemm
. assign_entry
interface stores pointer to fk::matrix
with asserts/size checking; assign_raw
stores a pointer without such checks. assign_raw
was added after performance measurements showed that creating/destroying many matrix views can become costly (mostly due to reference counting using smart pointers in fk::matrix
class).
- class
batch_chain
Given the operands for a kronmult (a list of matrices and a vector), as well as some externally allocated temporary workspace, the batch_chain
class internally constructs batch
members for batched_gemm
. The execute
member actually performs the multiplications by calling into the lib_dispatch
component. Currently, only used for real space transform.
- function
batched_gemm
/batched_gemv
Thin layer over lib_dispatch
functions with the same name; lib_dispatch
accepts BLAS-style arguments, so these functions translate/setup for those calls.
- function
build_system_matrix
The implicit time advance is driven by solving the system matrix. Given the elements::table
describing which elements are active, as well as the 1D coefficient matrices from the PDE class, this function explicitly constructs the system matrix.
This entire component is likely to be removed in the future. build_system_matrix
belongs in the time_advance
component or elsewhere; the batch_chain
/batch
classes are only used for the wavelet to real space transforms in transformations
. The current implementation for this transform is a proof of concept that does not scale well and is not distributed across nodes.
Some problems require the application (by addition) of boundary condition vectors to the solution vector at every timestep. This component a) generates unscaled boundary condition vectors and b) provides functions that scale the boundary condition vectors for the current time. The mathematical/physical significance of these vectors has not generally been understood on the C++ development side, but the above description demonstrates the most salient features of these vectors for C++ development: boundary condition vectors are tied to the current problem and grid (and so must be regenerated after adapting the grid), and must be scaled for time before addition to the solution vector. These features are shared with source vectors, although the implementations are different.
- function
make_unscaled_bc_parts
Given a PDE and grid, returns a pair of “unscaled_bc_parts” representing the left and right unscaled boundary condition vectors. The start and stop element integer parameters are used to generate only part of the vectors; these parameters are used to generate only the part of the vector assigned to a rank in the distribution plan.
- function
generate_scaled_bc
Accepts the unscaled boundary conditions and current time as input, and scales the vectors appropriately.
This component has a strange interface – an unscaled boundary condition vector is represented by a vector of vectors of vectors of fk::vectors
. This three level nesting is the result of each partial term in each term in each dimension contributing to the solution. However, we already have that exact structure in the PDE class – the PDE contains dimensions and terms that contain partial terms. Rather than existing in an isolated component, each partial term should be responsible for producing its own unscaled part, and the PDE should be responsible for providing the boundary condition vector. Housing source/boundary condition vector storage and generation in the PDE class makes sense logically, would declutter the API, and provide an opportunity to refactor these functions (which could stand a rewrite). The creation and application of source vectors is somewhat similar, and should also probably be consolidated in the PDE class.
Each PDE defines dimension-many coefficient matrices (sometimes called operator matrices”) for each partial term. These partial term matrices are chained together by matrix multiplication to form a single coefficient matrix for each term and dimension. These matrices are square, dense, and 2^level * degree
in size.
Information from elements::table
is used to draw values from these matrices for either 1) building the system matrix for implicit time advance or 2) performing system matrix/state vector multiplication for explicit time advance (or, in future, implicit time advance driven by iterative solve). Specifically, each element in the table contains a series of coordinates that designate, for each dimension, where that element should index into the coefficient matrices.
Currently, all coefficients are stored from construction onward on the GPU (if ASGarD is built to use GPUs). Only time-independent coefficients are supported at present, but we expect to support time-dependent coefficients as well.
- function
generate_all_coefficients
Takes an input/output PDE<P>
argument and fills its coefficient matrices by generating all partial term coefficient matrices for each and chaining them together by multiplication.
- function
generate_coefficients
Generates one partial term coefficient matrix. Invoked from generate_all_coefficients
.
A PDE’s coefficients (at least, the time-independent ones) should be constructed when the PDE is built (RAII). There is no great reason this is not already done. Aside from that, matrix coefficient generation and update functions could easily (and probably should) be member functions of the PDE class. This rework could take place when the planned overhaul of the PDE
class occurs.
The distribution component handles communication between MPI ranks when ASGarD is run with multiple ranks (nodes/GPUs). ASGarD expects one rank per GPU when built to use GPUs. To simplify problem decomposition, we also currently require an even or perfect square number of ranks. If an odd/nonsquare number of ranks is used, ASGarD will detect this condition and “round down” to the nearest valid number of ranks. Note ASGarD’s distribution only handles the explicit time advance case for now; running implicit time advance with multiple ranks will produce an error message and early exit.
ASGarD decomposes the problem across ranks by decomposing the element grid (see elements
section), since each element connection contains the information needed to produce a partial solution. We divide the grid into rectangular subgrids as “squarely” as possible given the number of ranks ASGarD is run with (see get_num_subgrid_cols
function). We tile these squares in row major order over the element grid, assigning the element connections within each tile to a single rank. The last subgrid in each tile row may be smaller than the others, as the number of elements may not divide the tile size.
The current implementation will produce tiles containing at least as many tile rows as tile columns (i.e., there will be at least as many vertical lines as there are horizontal lines dividing the elment grid). All subgrids within a tile row will write (additively) to coefficients in the same region of y
. So, each tile row must update y
atomically (current approach) or write separately and reduce into y
(previous approach). All subgrids within a tile column will read from the same segment of x
.
Two major interprocess communication tasks are required by ASGarD and provided by this component – 1) decomposing/distributing the solution vector and 2) distributing changes to elements::table
after element grid adaptation. The first task is accomplished by a balanced pairing of senders with receivers, where senders write to the region of y
that receivers need for the next timestep x
. Once this pairing is complete, point-to-point communication is used to exchange vector segments. The second task is accomplished by all-to-all communication to form consensus on the new element table, followed by a redistribution of x
such that all ranks have the information they need for the next timestep. Task 1 code was implemented first, followed by task 2. Efforts were made to consolidate/share code between these two tasks, but some duplication exists as a result of dissatisfaction with task 1 implementation.
- helper struct
node_out
Allows for printing a message once for each node, so that output isn’t clobbered by every local rank’s output.
- struct
message
Simple data object representing an MPI point-to-point message that should be sent. Fields for direction (send/receive), and a range of elements whose coefficients should be transferred.
- function
get_plan
Generates a map from each rank number (0, .. , n-1 for n ranks) to a subgrid representing all the element connections assigned to that rank. Invokes get_subgrid
for each rank.
- function
reduce_results
Deprecated since we use atomic update into y
now. Previously, each rank within a tile row would use this function to reduce y
by gathering each row member’s contribution.
- function
exchange_results
Takes as input source
(y
), and dest
(x_next
) for each rank, as well as the distribution plan. Partners ranks and transfers data using MPI from senders source
vector to receivers’ dest
vector to ensure all ranks have correct x_next
for the next timestep.
- function
gather_errors
Gathers RMSE figures from all ranks on node, for display after each timestep.
- function
gather_result
Gathers all coefficients in a single vector. Note we use a std::vector due to its 64 bit size field; our fk::vectors are currently 32-bit indexed.
- function
get_global_max
Adaptivity threshold is based on max coefficient value; find maximum across all ranks.
- function
distribute_table_changes
Share changes to element table across ranks.
- function
redistribute_vector
After consolidating the new element table, redistribute x
so that all ranks have coefficients (or zeros in the case of refined elements) for the next timestep. Old plan and new plan size should match; assumes elements are only appended to the list of active elements, or deleted with elements shifted left to fill deleted regions.
-
The helper functions for task 1 are unnecessarily complex (over engineered). Lots of simplification needed in pairing/partnering code.
-
Not in a namespace like the newer components. Should be placed in a
distribution::
namespace. -
MPI performance has not been an emphasis for us, as we have focused on adding features/node local performance. Performance evaluation needed.
Each of the “elements” represented/managed by this component have three primary properties:
- an arbitrary but consistent index in the list of active elements. This determines where each element
i
reads from the system state vectorx
(x_i
) and writes to the next statey
(y_i
). - coordinates that determine where information from operators (coefficient matrices) should be retrieved to apply to
x_i
to yieldy_i
. Each element has dimension-many {level, cell} coordinates. - a unique integral id used to identify elements.
We assume full element connectivity, so for each pair of elements, a connection exists that represents some piece of the work that must be done to form the system matrix (implicit time advance) or apply the system matrix to the state vector (Ax
, explicit time advance) to advance time. The elements
component contains helper functions for working with elements, as well as the elements::table
object that manages a list of active elements, their ordering, and their coordinates.
In short, each element connection is a unit of work that must be done to advance time. The part of the system matrix the element connection requires to advance time is row-indexed by the element’s ordering, and column-indexed by the connected element’s ordering. Consequently, x_i
is indexed by the connected element’s ordering, and y_i
is indexed by the element’s ordering. Elements contain coordinates that locate submatrices within the single-dimensional coefficient matrix portions required to advance from x_i
to y_i
when the system matrix is not explicitly formed. See kronmult
component.
- class
elements::table
Given a PDE and the number of hierarchical basis levels/basis degree, the elements::table
constructor applies a sparsity rule to build a list of active elements. This class also includes functions to enable/disable elements when the element grid is adapted (see adaptivity component). get_child_elements
function is provided to assist with refinement.
Element coordinates can be retrieved with the get_coords
member function, but a flattened version of the element table (with all active coordinates stored contiguously in memory as ordered by the table) is also prepared to assist with on-device kronmult.
- functions
map_to_id
andmap_to_coords
Translate between an elements’ unique integral id and coordinates.
- functions
get_level_cell
andget_1d_index
Translate between a {level, cell} coordinate and a unique id for that coordinate.
This is one of the clearer components. One potential improvement is in the relationship between the elements::table
and adapt::distributed_grid
. The distributed_grid
contains an element table member. This makes sense because the distributed_grid
must add/remove elements from the table during adaptivity. However, the table can be retrieved with a simple getter. It would be better to provide a more limited interface.
This simple component primarily exists as a layer over lib_dispatch
within the fm
namespace. The main functions in this component accept vector/matrix class (see tensors
component) arguments and invoke lib_dispatch
to perform BLAS operations. The tensor class also has member functions that invoke BLAS (e.g., fk::matrix * fk::vector operator invokes gemv
), but the fast_math
API allows the user additional flexibility. For example, the fm::gemm
function allows the user to specify a view or owning tensor argument C
(the result), whereas the member functions will always allocate a new object for C (constraint of return dynamics).
Some utility math functions may also be placed here, when they do not correspond directly to MATLAB functionality – otherwise, those functions are placed in matlab_utilities
.
- BLAS-like functions (
scal
/gemv
/gemm
/nrm2
/copy
/axpy
/etc.)
Perform BLAS operations on tensors.
Could be extended as needed with more functionality – not all BLAS or common BLAS extensions are available here.
This very small component wraps the HighFive header-only HDF5 library. Its only current use is writing the initial condition/solution vector to disk. To enable io, users must 1) build ASGarD with ASGARD_IO_HIGHFIVE
defined, and 2) select a wavelet and/or realspace write frequency at runtime (see ./asgard –help
).
-
initialize_output_file
function
Initializes an HDF5 dataspace and writes the initial condition to disk.
-
update_output_file
function
Writes the solution vector to disk.
Implemented before component namespaces – functions should exist within an io
namespace.
Physicists may require additional output information beyond the solution vector, e.g. the adapted grid for a given timestep. This component could be extended to write more information.
Additionally, not designed to work in the distributed case. In this case, every local rank will write over its fellow local rank filenames. There is also no provision for writing the final solution gathered from all ranks.
This component provides an interface to kronmult_cuda
, where the functions that handle the core computational kernel in explicit timestepping (and in future iterative implicit timestepping) – conceptually, y = Ax
, where A
is the system matrix representing PDE operators, x
is the current state vector and y
is the next state vector. For these routines, the tensor-encoded A
is not explicitly formed. Instead, for each element connection, element coordinates are used to retrieve small, dense submatrices {A_0
, … , A_ndims-1
} from each single-dimensional operator matrix (tensor “factor” of A
). For each element connection (and each term) we compute a “kronmult” (Kronecker product * vector): y_i
= (A_0
kron A_1
kron … A_ndims-1
) * x_i
. We do not explicitly form the Kronecker product used to compute y_i
; the artihmetic is reformulated as a series of small matrix-matrix multiplications.
Kronmults are not actually performed by this component - kronmult_cuda
handles that – but this component provides the necessary setup routines for problem decomposition and memory allocation.