-
Notifications
You must be signed in to change notification settings - Fork 20
Component Overview
These notes are meant to provide a brief summary of ASGarD’s major code components. The goal in writing these is to document the code, but also to provide a place where math/physics professionals can add their insight to improve the team’s overall understanding of how the code works. So, this is not a closed writeup – please use your expertise to add detail and fix misconceptions.
“Adaptivity” describes ASGarD’s capability to add/remove element grid points as needed to more efficiently and accurately model a solution. Elements responsible for relatively large coefficients in the solution may be “refined” (have child elements added to the grid). Elements whose coefficients are very small may be “coarsened” (removed from the grid). The thresholds for coarsening and refinement are controlled by an input argument to ASGarD (--thresh
).
- class
distributed_grid
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/adapt.hpp#L37
The distributed_grid
is composed of the element table (from elements
component) and the distribution_plan
(distribution
component) that maps each MPI rank in the simulation to the elements for which the rank must compute coefficients. In short, this class bundles information about the elements and how they are distributed, and provides functions to adapt the grid given the solution vector with elements’ coefficient values.
The main routines driving this functionality (distributed_grid::refine
and distributed_grid::coarsen
) refine/coarsen the element grid by adding/deleting elements from its member elements::table
, and then redistributing responsibility for the new element grid across ranks using functions in the distribution
component.
-
It is reasonable to encapsulate the element table with the distribution plan, since the adaptivity functions require altering the grid and rebalancing the problem across ranks, but providing
const &
getters for these members is not ideal. It would be better to provide a stricter interface allowing outside components to access the functionality they need without exposing (for read) the objects themselves. -
Currently, there is no limit (beyond a level cap) on refinement/coarsening. This introduces two related problems; refinement to the extent that some rank’s portion of the solution vector spills out of RAM (virtually unlimited growth) and coarsening to the extent that fewer elements remain than ranks. We will need to gracefully handle both situations.
Most computations within ASGarD are performed in wavelet space, but the PDEs we want to solve are defined in real space. So, we require a means of transforming vectors/matrices into wavelet space. Most of this functionality is provided in the transformations
component, but the basis
component handles the operator required for transformations. Originally, this operator was explicitly formed (as a matrix), with application via matrix multiplication as appropriate to transform values. However, because the transform matrix is sparse with large zero regions and a regular pattern of dense blocks, we now form and store only the dense blocks. Functions are provided in the basis
component to apply the transform matrix using these blocks, skipping the unnecessary arithmetic large zero regions would introduce.
- function
operator_two_scale
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/basis.hpp#L15
Deprecated. Explicitly forms the transform matrix. Uses generate_multi_wavelets
, which is conceptually private but exposed in the interface for testing purposes.
- class
wavelet_transform
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/basis.hpp#L32
Constructor forms the dense blocks of the transform matrix. apply
methods provide the result of matrix multiplication using these dense blocks.
The dense blocks required for the transform matrix are relatively small and their formation is not costly. So, rather than constructing one instance of this class and passing it around for transforms as needed, we could create instances as needed at function scope to declutter APIs requiring wavelet_transform
parameters.
Both the deprecated operator_two_scale
and generate_multi_wavelets
functions are lengthy and obscure in places. The C++ developers are not necessarily familiar with the underlying math performed by these functions; this makes optimization or clarification of the MATLAB code that served as a template for these routines a challenge. There is an opportunity for someone with math insight to clarify the code here.
“Kronmult” (Kronecker product of matrices followed by matrix-vector multiplication , Ax=b
, where A
is a tensor product) is one of the core computational kernels in ASGarD. We often avoid explicit formation of the Kronecker product by instead multiplying each Kronecker operand (in turn) directly to the vector x
conceptually reshaped as a small matrix. High dimensional/high resolution problems may require O(10^6)
or more small kronmults to move the simulation forward at each timestep.
A prior implementation of the time stepping code performed these many kronmults by constructing dimension-many lists (batches) of matrix-multiplication operands and invoking vendor-supplied batched_gemm
routines. The batch
component was designed to support constructing, storing, and passing pointer lists for batched_gemm
.
The batch
component is no longer used to drive explicit timestepping in time_advance
; the kronmult
component now serves this purpose. However, some functions within the component are used for implicit timestepping, and the “kronmult as batched gemm” technique is employed for wavelet to realspace transforms in the transformations
component.
Provides storage for fixed-size list of pointers to same-sized matrices for batched_gemm
. assign_entry
interface stores pointer to fk::matrix
with asserts/size checking; assign_raw
stores a pointer without such checks. assign_raw
was added after performance measurements showed that creating/destroying many matrix views can become costly (mostly due to reference counting using smart pointers in fk::matrix
class).
- class
batch_chain
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/batch.hpp#L118
Given the operands for a kronmult (a list of matrices and a vector), as well as some externally allocated temporary workspace, the batch_chain
class internally constructs batch
members for batched_gemm
. The execute
member actually performs the multiplications by calling into the lib_dispatch
component. Currently, only used for real space transform.
- function
batched_gemm
/batched_gemv
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/batch.hpp#L71
Thin layer over lib_dispatch
functions with the same name; lib_dispatch
accepts BLAS-style arguments, so these functions translate/setup for those calls.
- function
build_system_matrix
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/batch.hpp#L147
The implicit time advance is driven by solving the system matrix. Given the elements::table
describing which elements are active, as well as the 1D coefficient matrices from the PDE class, this function explicitly constructs the system matrix.
This entire component is likely to be removed in the future. build_system_matrix
belongs in the time_advance
component or elsewhere; the batch_chain
/batch
classes are only used for the wavelet to real space transforms in transformations
. The current implementation for this transform is a proof of concept that does not scale well and is not distributed across nodes.
Some problems require the application (by addition) of boundary condition vectors to the solution vector at every timestep. This component a) generates unscaled boundary condition vectors and b) provides functions that scale the boundary condition vectors for the current time. The mathematical/physical significance of these vectors has not generally been understood on the C++ development side, but the above description demonstrates the most salient features of these vectors for C++ development: boundary condition vectors are tied to the current problem and grid (and so must be regenerated after adapting the grid), and must be scaled for time before addition to the solution vector. These features are shared with source vectors, although the implementations are different.
- function
make_unscaled_bc_parts
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/boundary_conditions.hpp#L15
Given a PDE and grid, returns a pair of “unscaled_bc_parts” representing the left and right unscaled boundary condition vectors. The start and stop element integer parameters are used to generate only part of the vectors; these parameters are used to generate only the part of the vector assigned to a rank in the distribution plan.
- function
generate_scaled_bc
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/boundary_conditions.hpp#L21
Accepts the unscaled boundary conditions and current time as input, and scales the vectors appropriately.
This component has a strange interface – an unscaled boundary condition vector is represented by a vector of vectors of vectors of fk::vectors
. This three level nesting is the result of each partial term in each term in each dimension contributing to the solution. However, we already have that exact structure in the PDE class – the PDE contains dimensions and terms that contain partial terms. Rather than existing in an isolated component, each partial term should be responsible for producing its own unscaled part, and the PDE should be responsible for providing the boundary condition vector. Housing source/boundary condition vector storage and generation in the PDE class makes sense logically, would declutter the API, and provide an opportunity to refactor these functions (which could stand a rewrite). The creation and application of source vectors is somewhat similar, and should also probably be consolidated in the PDE class.
Each PDE defines dimension-many coefficient matrices (sometimes called operator matrices”) for each partial term. These partial term matrices are chained together by matrix multiplication to form a single coefficient matrix for each term and dimension. These matrices are square, dense, and 2^level * degree
in size.
Information from elements::table
is used to draw values from these matrices for either 1) building the system matrix for implicit time advance or 2) performing system matrix/state vector multiplication for explicit time advance (or, in future, implicit time advance driven by iterative solve). Specifically, each element in the table contains a series of coordinates that designate, for each dimension, where that element should index into the coefficient matrices.
Currently, all coefficients are stored from construction onward on the GPU (if ASGarD is built to use GPUs). Only time-independent coefficients are supported at present, but we expect to support time-dependent coefficients as well.
- function
generate_all_coefficients
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/coefficients.hpp#L7
Takes an input/output PDE<P>
argument and fills its coefficient matrices by generating all partial term coefficient matrices for each and chaining them together by multiplication.
- function
generate_coefficients
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/coefficients.hpp#L12
Generates one partial term coefficient matrix. Invoked from generate_all_coefficients
.
A PDE’s coefficients (at least, the time-independent ones) should be constructed when the PDE is built (RAII). There is no great reason this is not already done. Aside from that, matrix coefficient generation and update functions could easily (and probably should) be member functions of the PDE class. This rework could take place when the planned overhaul of the PDE
class occurs.
Also, we currently only support time-independent coefficients. We should add regeneration of time-dependent coefficients in the time advance loop.
The distribution component handles communication between MPI ranks when ASGarD is run with multiple ranks (nodes/GPUs). ASGarD expects one rank per GPU when built to use GPUs. To simplify problem decomposition, we also currently require an even or perfect square number of ranks. If an odd/nonsquare number of ranks is used, ASGarD will detect this condition and “round down” to the nearest valid number of ranks. Note ASGarD’s distribution only handles the explicit time advance case for now; running implicit time advance with multiple ranks will produce an error message and early exit.
ASGarD decomposes the problem across ranks by decomposing the element grid (see elements
section), since each element connection contains the information needed to produce a partial solution. We divide the grid into rectangular subgrids as “squarely” as possible given the number of ranks ASGarD is run with (see get_num_subgrid_cols
function). We tile these squares in row major order over the element grid, assigning the element connections within each tile to a single rank. The last subgrid in each tile row may be smaller than the others, as the number of elements may not divide the tile size.
The current implementation will produce tiles containing at least as many tile rows as tile columns (i.e., there will be at least as many vertical lines as there are horizontal lines dividing the elment grid). All subgrids within a tile row will write (additively) to coefficients in the same region of y
. So, each tile row must update y
atomically (current approach) or write separately and reduce into y
(previous approach). All subgrids within a tile column will read from the same segment of x
.
Two major interprocess communication tasks are required by ASGarD and provided by this component – 1) decomposing/distributing the solution vector and 2) distributing changes to elements::table
after element grid adaptation. The first task is accomplished by a balanced pairing of senders with receivers, where senders write to the region of y
that receivers need for the next timestep x
. Once this pairing is complete, point-to-point communication is used to exchange vector segments. The second task is accomplished by all-to-all communication to form consensus on the new element table, followed by a redistribution of x
such that all ranks have the information they need for the next timestep. Task 1 code was implemented first, followed by task 2. Efforts were made to consolidate/share code between these two tasks, but some duplication exists as a result of dissatisfaction with task 1 implementation.
- helper struct
node_out
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L165
Allows for printing a message once for each node, so that output isn’t clobbered by every local rank’s output.
- struct
message
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L199
Simple data object representing an MPI point-to-point message that should be sent. Fields for direction (send/receive), and a range of elements whose coefficients should be transferred.
- function
get_plan
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L187
Generates a map from each rank number (0, .. , n-1 for n ranks) to a subgrid representing all the element connections assigned to that rank. Invokes get_subgrid
for each rank.
- function
reduce_results
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L233
Deprecated since we use atomic update into y
now. Previously, each rank within a tile row would use this function to reduce y
by gathering each row member’s contribution.
- function
exchange_results
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L243
Takes as input source
(y
), and dest
(x_next
) for each rank, as well as the distribution plan. Partners ranks and transfers data using MPI from senders source
vector to receivers’ dest
vector to ensure all ranks have correct x_next
for the next timestep.
- function
gather_errors
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L250
Gathers RMSE figures from all ranks on node, for display after each timestep.
- function
gather_results
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L255
Gathers all coefficients in a single vector. Note we use a std::vector due to its 64 bit size field; our fk::vectors are currently 32-bit indexed.
- function
get_global_max
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L272
Adaptivity threshold is based on max coefficient value; find maximum across all ranks.
- function
distribute_table_changes
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L276
Share changes to element table across ranks.
- function
redistribute_vector
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L296
After consolidating the new element table, redistribute x
so that all ranks have coefficients (or zeros in the case of refined elements) for the next timestep. Old plan and new plan size should match; assumes elements are only appended to the list of active elements, or deleted with elements shifted left to fill deleted regions.
-
The helper functions for task 1 are unnecessarily complex (over engineered). Lots of simplification needed in pairing/partnering code.
-
Not in a namespace like the newer components. Should be placed in a
distribution::
namespace. -
MPI performance has not been an emphasis for us, as we have focused on adding features/node local performance. Performance evaluation needed.
Each of the “elements” represented/managed by this component have three primary properties:
- an arbitrary but consistent index in the list of active elements. This determines where each element
i
reads from the system state vectorx
(x_i
) and writes to the next statey
(y_i
). - coordinates that determine where information from operators (coefficient matrices) should be retrieved to apply to
x_i
to yieldy_i
. Each element has dimension-many {level, cell} coordinates. - a unique integral id used to identify elements.
We assume full element connectivity, so for each pair of elements, a connection exists that represents some piece of the work that must be done to form the system matrix (implicit time advance) or apply the system matrix to the state vector (Ax
, explicit time advance) to advance time. The elements
component contains helper functions for working with elements, as well as the elements::table
object that manages a list of active elements, their ordering, and their coordinates.
In short, each element connection is a unit of work that must be done to advance time. The part of the system matrix the element connection requires to advance time is row-indexed by the element’s ordering, and column-indexed by the connected element’s ordering. Consequently, x_i
is indexed by the connected element’s ordering, and y_i
is indexed by the element’s ordering. Elements contain coordinates that locate submatrices within the single-dimensional coefficient matrix portions required to advance from x_i
to y_i
when the system matrix is not explicitly formed. See kronmult
component.
- class
elements::table
(additional documentation in code) https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/elements.hpp#L56
Given a PDE and the number of hierarchical basis levels/basis degree, the elements::table
constructor applies a sparsity rule to build a list of active elements. This class also includes functions to enable/disable elements when the element grid is adapted (see adaptivity component). get_child_elements
function is provided to assist with refinement.
Element coordinates can be retrieved with the get_coords
member function, but a flattened version of the element table (with all active coordinates stored contiguously in memory as ordered by the table) is also prepared to assist with on-device kronmult.
- functions
map_to_id
andmap_to_coords
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/elements.hpp#L23
Translate between an elements’ unique integral id and coordinates.
- functions
get_level_cell
andget_1d_index
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/elements.hpp#L16
Translate between a {level, cell} coordinate and a unique id for that coordinate.
This is one of the clearer components. One potential improvement is in the relationship between the elements::table
and adapt::distributed_grid
. The distributed_grid
contains an element table member. This makes sense because the distributed_grid
must add/remove elements from the table during adaptivity. However, the table can be retrieved with a simple getter. It would be better to provide a more limited interface.
This simple component primarily exists as a layer over lib_dispatch
within the fm
namespace. The main functions in this component accept vector/matrix class (see tensors
component) arguments and invoke lib_dispatch
to perform BLAS operations. The tensor class also has member functions that invoke BLAS (e.g., fk::matrix * fk::vector operator invokes gemv
), but the fast_math
API allows the user additional flexibility. For example, the fm::gemm
function allows the user to specify a view or owning tensor argument C
(the result), whereas the member functions will always allocate a new object for C (constraint of return dynamics).
Some utility math functions may also be placed here, when they do not correspond directly to MATLAB functionality – otherwise, those functions are placed in matlab_utilities
.
- BLAS-like functions (
scal
/gemv
/gemm
/nrm2
/copy
/axpy
/etc.) https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/fast_math.hpp#L156
Perform BLAS operations on tensors.
Could be extended as needed with more functionality – not all BLAS or common BLAS extensions are available here.
This very small component wraps the HighFive header-only HDF5 library. Its only current use is writing the initial condition/solution vector to disk. To enable io, users must 1) build ASGarD with ASGARD_IO_HIGHFIVE
defined, and 2) select a wavelet and/or realspace write frequency at runtime (see ./asgard –help
).
-
initialize_output_file
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/io.hpp#L13
Initializes an HDF5 dataspace and writes the initial condition to disk.
-
update_output_file
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/io.hpp#L43
Writes the solution vector to disk.
Implemented before component namespaces – functions should exist within an io
namespace.
Physicists may require additional output information beyond the solution vector, e.g. the adapted grid for a given timestep. This component could be extended to write more information.
Additionally, not designed to work in the distributed case. In this case, every local rank will write over its fellow local rank filenames. There is also no provision for writing the final solution gathered from all ranks.
This component provides an interface to kronmult_cuda
, where the functions that handle the core computational kernel in explicit timestepping (and in future iterative implicit timestepping) – conceptually, y = Ax
, where A
is the system matrix representing PDE operators, x
is the current state vector and y
is the next state vector. For these routines, the tensor-encoded A
is not explicitly formed. Instead, for each element connection, element coordinates are used to retrieve small, dense submatrices {A_0
, … , A_ndims-1
} from each single-dimensional operator matrix (tensor “factor” of A
). For each element connection (and each term) we compute a “kronmult” (Kronecker product * vector): y_i
= (A_0
kron A_1
kron … A_ndims-1
) * x_i
. We do not explicitly form the Kronecker product used to compute y_i
; the artihmetic is reformulated as a series of small matrix-matrix multiplications.
Kronmults are not actually performed by this component - kronmult_cuda
handles that – but this component provides the necessary setup routines for problem decomposition and memory allocation.
- function
execute
(public) https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/kronmult.hpp#L19
Given all the objects necessary to represent the problem (PDE<P>
with coefficient matrices, elements::table
, state vector x
, subgrid
assigned to rank) and workspace constraints, decompose the problem such that it fits within the workspace, and perform Ax
, returning y
. Invokes functions to 1) allocate memory (or realloc if the grid is adapted), 2) decompose the problem into smaller parts that fit in memory, 3) build pointer lists for kronmults, and 4) execute kronmults to perform Ax
using kronmult_cuda
component functions. Modifications to the underlying “kronmult” library (https://github.com/project-asgard/kronmult) not yet included in ASGarD move the allocation and decomposition tasks into the kronmult library. Using this newer version will require stripping the kronmult
component of these roles.
- function
decompose
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/kronmult.cpp#L94
Function that breaks a rank’s assigned subgrid into a set of smaller ones such that the problem fits in memory. execute
function iterates over the returned subgrid set to perform all the necessary work for Ax
.
- function
execute
(private) https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/kronmult.cpp#L276
Invokes the kronmult_workspace<P>::get_workspace
function to allocate/reallocate workspace memory, and then calls into kronmult_cuda
functions to execute one partial subgrid’s kronmult. Iteratively invoked by the public execute
function for each partial subgrid.
- class
kronmult_workspace
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/kronmult.cpp#L145
Allocates/reallocates workspace for kronmult. The workspace is dominated by the size of workspace needed for “intermediate products” during each kronmult. Each kronmult requires computing: y_i
= (A_0
kron A_1
kron … A_ndims-1
) * x_i
, but we perform this math as a chain of small matrix-matrix products. Thus, we need space to store the input from the last multiplication, and the output for the current multiplication – 2 workspaces per kronmult.
To compute workspace size, note the number of elements in x_i
and y_i
is degree^ndim
, where degree
is the degree of the Legendre basis polynomials (-d
argument at command line) and ndim
is the number of dimensions in the PDE being solved. Each matrix A_i
in the Kronecker product is square with degree
rows. So, the size of intermediate products is also degree^dim
. We need two such workspaces (input/output) for each kronmult, so the total workspace size is num_kronmults
* deg^dim
.
Additional space is also required for building the pointer lists that are passed into kronmult_cuda
to perform the arithmetic. These lists contain pointers to elements’ A_i
matrices, as well as x_i
and y_i
for each element, along with the workspace region for each element.
Adaptivity may increase/decrease the number of elements in the element grid (and thus, the number of kronmults we need to comptue). This event will trigger reallocation. Reallocation will not occur if the number of elements assigned to a rank stays the same or decreases. Because workspaces are allocated on GPU (slow), we avoid reallocation when possible.
The newest kronmult library handles decomposition and allocation internally; these parts of the kronmult
component should eventually be stripped out. Performance regression testing should be used to ensure the new allocation scheme is performant relative to the current implementation.
Optimizations could be made to pointer lists, as well. Given that we already have an ordering of elements, we could just use element number and offset from x
and y
to compute an elements x_i
, y_i
, and workspace position.
This short-sightedly named kronmult_cuda
component (should be kronmult_device
) contains the three functions that prepare for and call into the kronmult library (https://github.com/project-asgard/kronmult) to concurrently perform some number of kronmults (see kronmult
component). These functions will be run on GPU if ASGARD_USE_CUDA is defined at build time, with fallback to OpenMP if it is not. So this component, along with the kronmult library, is the computational core of the code for explicit timestepping, and for iterative implicit stepping (once implemented).
- function
stage_inputs_kronmult
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/device/kronmult_cuda.hpp#L4
Each kronmult uses one input workspace and one output workspace (see kronmult
section). This function copies x_i
into the workspace for each kronmult to provide the initial input for the kronmult’s matrix multiplication chain.
- function
prepare_kronmult
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/device/kronmult_cuda.hpp#L8
The prepare function builds pointer lists for each kronmult. These pointer lists include the position of operator submatrices and x_i
/y_i
segments for each kronmult. The kronmult library uses this information to read/store information in the expected place for each kronmult.
- function
call_kronmult
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/device/kronmult_cuda.hpp#L18
Sends the pointer lists built by the prior routine to the kronmult library for computation. If built with CUDA, ASGarD utilizes one thread block for each kronmult.
These kernels were profiled and optimized at the 2020 NERSC Hackathon, but additional optimization is possible.
Additionally, if we coordinate the ordering of elements between kronmult library and kronmult_cuda
component, we could use base pointer + offset techniques to reduce pointer list sizes.
Despite being added after our component namespace policy was implemented, this component is frustratingly not in its own namespace (unambiguously Tyler’s fault).
Finally, we could “hipify” these kernels. Along with a “hipified” kronmult library and some other minor changes (e.g., warp/thread block sizing), we could then run ASGarD (explicit stepping) on AMD GPUs. In this case, we could generalize the component name to kronmult_device
or similar.
This simple component provides templated precision/device execution wrappers for BLAS functions, which are primarily invoked by the tensor
and fast_math
components. Wrapper signatures mimic the BLAS API, with the precision identifier (e.g. s or d) dropped. Currently, we do not support complex types, as there is not yet any complex arithmetic in ASGarD. The component does not provide a complete BLAS; functions are added as needed by 1) adding an extern "C"
-qualified declaration for the BLAS functions to add and 2) adding a wrapper templated on precision than invokes BLAS/CUBLAS functions.
Supporting non-NVIDIA devices will require extension to this component (as well as kronmult_cuda
and tensors
).
-
gemm
/gemv
/etc. wrapper functios, e.g.: https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/lib_dispatch.hpp#L135
Invokes BLAS/CUBLAS routines with correct precision.
- struct
device_handler
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/lib_dispatch.cpp#L33
This statically initialized struct prepares the CUBLAS library when ASGarD is built for GPU execution (ASGARD_USE_CUDA
).
We will likely need additional BLAS functions - these can be added easily by supplying more wrapper functions with tests.
Supporting non-NVIDIA devices will require modifying both the device_handler
and adding additional arguments/paths to wrappers.
The project's mathematicians and physicists develop the methods/mathematical techniques for ASGarD in a separate MATLAB implementation: https://github.com/project-asgard/DG-SparseGrid. New features are generally added to ASGarD after math/physics professionals first implement a MATLAB "template". Then, test data (inputs/outputs) are generated from the MATLAB implementation and stored in [ASGarD root]/testing/generated-inputs/[component-name]
. Finally, a C++ version of the feature is implemented based on the MATLAB "template", with modifications made for performance and design consistency, and tests are added to ensure the C++ feature faithfully replicates the functionality of the MATLAB template.
Because of this MATLAB-to-C++ development process, it is helpful to implement some MATLAB built-ins in C++. These functions are placed in the matlab_utilities
component. Note that given the differences between MATLAB and C++, the C++ functions may not exactly match the MATLAB API. Additionally, testing against MATLAB requires functions to read MATLAB-written vectors/matrices from disk, and these functions are also housed in this component.
This component is primarily used in the routines that initially set up the problem before enterign the loop that advances time in the code.
-
read_[scalar/vector/matrix]_from_txt_file
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/matlab_utilities.hpp#L123
These functions are designed to read data written by functions in the MATLAB repository (specifically, the write_octave_like_output
function) for testing.
-
find
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/matlab_utilities.hpp#L79
Returns the indices in an fk::vector
/fk::matrix
for which the supplied predicate returns true.
Return an identity matrix with the given size.
-
linspace
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/matlab_utilities.hpp#L56
Returns a vector with N linearly-spaced elements between the provided start and end values.
-
polyval
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/matlab_utilities.hpp#L64
Evaluate a polynomial. Specifically, yields y = p(0)*x^n + p(1)*x^(n-1) + ... + p(n-1)*x + p(n)
, where p
is a vector of length n
whose elements are the coefficients of the polynomial in descending power order. x
can also be vector-valued; in this case, a vector y
is returned the the polynomial evaluated for each element in x
.
This is an older component implemented before the component-namespace policy. These functions should be placed in one namespace.
The PDE<P>
class is one of the foundational abstractions in the code, representing the problem we seek to solve with ASGarD. PDE<P>
is defined in pde/pde_base.hpp
; this component merely provides a factory routine to instantiate a PDE<P>
given an enum type argument.
-
make_PDE
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/pde.hpp#L45
Given a PDE_opts
(defined in program_options
component) argument, construct and return a PDE<P>
(defined in pde/pde_base.hpp
). Because each specific PDE solved by ASGarD is modeled by a class derived from the base PDE<P>
class, the factory returns a unique_ptr<PDE<P>>
(see https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rc-factory).
The PDE<P>
class is overly complex, and adding a specific PDE to the code requires lots of steps: creating a class derived from PDE<P>
, adding an enum value (PDE_opts
) in program_options
, adding a switch in the factory method in this component, and adding tests in many separate test files (boundary_conditions
, pde
, time_advance
, coefficients
...). After overhauling the way PDEs are specified, this constructor may need modification.
This component is also not isolated within its own namespace, as the newer components are.
The PDE<P>
class defined in this component represents the interface for the problem ASGarD will solve. Derived classes (also found in the pde/
directory) model specific problems.
A PDE with ndims
dimensions is primarily composed of:
-
ndims
manydimension
objects, each of which defines the number of hierarchical basis levels and domain for one dimension. -
An arbitrary number of functions to define source vectors. These are added to the solution vector when time is advanced.
-
One or more
term_set
objects. Eachterm_set
must define dimension-many single dimensionalterm
s, each with its own coeffcient matrix (seecoefficients
,kronmult
). -
Each
term
has one or morepartial_term
objects. These partial terms each define a coefficient matrix. Eachterm
's coefficient matrix is formed by multiplying together itspartial_term
coefficient matrices.
-
PDE<P>
class.
This class is the interface for problems solved with ASGarD. dimension
, term
, and other objects defined in this component are building blocks for PDE<P>
. The constructor is invoked once during problem setup; its arguments are passed from values statically defined in the specific (derived) PDE in the pde/
directory requested by the user (-p
argument). The single dimensional coefficient matrices that together conceptually represent the system matrix A
are stored in this class, and can be retrieved with the get_coeffcients
function.
Of all the components in ASGarD, this one needs the most work. Ideally, we would like to specify PDEs in a more natural way, as (potentially non-expert) users will need to perform this task to solve custom problems. A DSL for representing PDEs could be one approach; this could be especially useful if DSL files could be read by both MATLAB and C++, since the process of porting PDEs between languages is fraught with opportunities to make mistakes.
The primary challenge in "fixing" the way PDEs are represented in ASGarD is specifying, storing, and passing around functions. Statically defining function sets, as we do now, works. But, this is far from the cleanest/most efficient way to add problems to the code.
This component provides the level component of the level and cell coordinates for all elements that will be included in the initial element grid ("active" elements). Each element in the grid must have ndims*2
coordinates - a level and cell coordinate for each dimension (see elements
component). The elements::table
constructor uses permutations
functions to generate a list of level coordinate sets to be included in the grid. Each set of coordinates in the list will add multiple elements to the grid - the elements
component determines all valid cell coordinates (see elements::table::get_cell_index_set
) for each level coordinate set.
-
get_max_multi
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/permutations.hpp#L46
Build all valid level coordinate sets (full grid).
-
get_lequal_multi
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/permutations.hpp#L39
Use a sparsity rule to filter valid level coordinates (sparse grid). The currently implemented rule excludes level coordinate sets where the sum of level coordinates exceeds the maximum initial level (-l
argument to ASGarD) across dimensions.
This component serves its limited role well. It is an almost direct port of code in the MATLAB implementation.
This component exists to parse command line input, and to store runtime options that are not PDE-related (those options are stored in the PDE<P>
class). A list of command line options can be printed to the console using ./asgard --help
. PDEs (specific problems) that can be solved by ASGarD can be listed using ./asgard --available-pdes
.
This class accepts command line input via its constructor, which parses arguments using the Clara header-only library. Getter functions are provided for retrieving user values. The is_valid
function returns true when user-provided values fall within acceptable ranges.
-
options
class https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/program_options.hpp#L281
The parser
is used to construct both the PDE<P>
object that represents the problem being solved, and the options
object that stores other user arguments. The options
object contains 1) public const
member variables where user-provided arguments are stored and 2) simple helpers that return whether or not ASGarD should write wavelet or realspace outputs at a given timestep.
The parser
object communicates when user-provided values fall outside acceptable ranges (e.g. negative number of hierarchical basis levels) by returning false from the is_valid
function. There may be a better way to communicate when user-provided values are invalid, but any solution should facilitate unit testing (e.g. cannot simply call exit
). The unit testing framework used by ASGarD (Catch) may provide functions that expect exceptions to be thrown for testing, but we have avoided all exceptions in general.
The Clara library is no longer maintained. There may be a maintained fork; we have not searched for one yet.
The coefficients
component requires evaluating Legendre polynomials (or their derivatives) at certain values (math folks please add details!).
-
legendre
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/quadrature.hpp#L15
Evaluate degree
many Legendre functions and their derivatives for the input domain. The function returns a std::array
of fk::matrices
. Element zero has a row for each value in the input domain; each of the degree
many columns represents a Legendre polynomial evaluated at that value. Element one is structured similarly, except that the first derivative of each polynomial is evaluated.
-
legendre_weights
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/quadrature.hpp#L26
(Taken from MATLAB. I have no idea. Math folks help!)
Computes the Legendre-Gauss nodes (roots on x) and weights on an interval [lower_bound
, upper_bound
] for Legendre polynomial of degree num_points
. These are later used for computing definite integrals using Legendre-Gauss Quadrature.
The standard library may contain overlapping functionality; we may be able to substitute some of these functions to reduce our maintainence/testing burden (https://en.cppreference.com/w/cpp/numeric/special_functions/legendre).
These functions should be placed within a quadrature
namespace.
Implicit timestepping methods in ASGarD are driven by a direct dense solve y = I - dt*A \ x
where A
is the system matrix and x
is the current state vector (see time_advance
). Currently, we explicitly form A
for this purpose when implicit timestepping is chosen by the user (-i
argument to ASGarD). However, we have not yet implemented a distributed direct solve, so implicit timestepping is limited to a single node. Also, the math professionals on the project are exploring other techniques for this solve requirement, including iterative methods like GMRES. The solver
component represents our (limited) exploration of iterative methods in the C++ code - a single, simple GMRES function that locally performs GMRES for an explicitly formed A
. However, since A
is very large and we already have a technique for distributed A*x
without forming A
in the kronmult
code, we could eventually implement distributed, matrix-free GMRES atop kronmult
.
The solve technique used in implicit stepping can be selected with the -s
argument to ASGarD. See ./asgard --help
.
- function
simple_gmres
https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/solver.hpp#L8
Given system matrix A
, state vector x
, initial guess b
, and some tolerance, approximate the solution to x = A\b
with residual less than abs(tolerance)
. A preconditioner M
can optionally be provided.
This is a simple first step toward iterative methods in ASGarD. As noted in the overview, we could eventually implement a distributed GMRES driver here that calls into the kronmult
code for A*x
.
The tensors
component houses the fk::matrix
and fk::vector
classes in the code with support for linear algebra operations through both member functions/operators and the fast_math
component.
Both fk::vector
and fk::matrix
can represent owning references or non-owning views, as well as CPU or GPU-resident data (see View Semantics: https://github.com/project-asgard/asgard/wiki/view-semantics). In addition to a template parameter describing a tensor's floating point precision, non-type template parameters and SFINAE are used to restrict the API and signal to other components when operations should take place on CPU or GPU. A tensors' "memory type" template parameter indicates whether it is owning or non-owning, and its "resource type" parameter indicates whether it is backed by CPU or GPU memory. For instance, the const_view
memory type indicates a non-owning, read-only view of an underlying owning tensor, and the resize
member function is disabled by SFINAE for this type. Many member functions that require direct random access are disabled for device
resource type tensors (those on GPU), but not host
tensors.
-
vector
class https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/tensors.hpp#L97
This extensive class represents a linear algebra vector. No distinction is made between row and column vectors. Because vectors represent a region of contiguous memory, both vector
and matrix
views can be constructed from vector
owners. So, vector
s are a means of pre-allocating some amount of flexible linear algebra memory. Member functions include various constructors for direct/copy/move construction of owners and views on CPU or GPU, as well as precision conversion/truncation "copy constructors" that translate between vectors of differing floating point precision. transfer_from
and clone_onto_[host/device]
functions provide a means of transferring vectors to/from GPU. Math operators, including +
and *
are defined to simplify arithmetic operations; note that fast_math
provides a more flexible API for these operations. Finally, utility fuctions, including print
, are useful for debugging.
-
matrix
class https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/tensors.hpp#L349
Like vector
, the matrix
class contains constructors/destructor, precision conversion routines, arithmetic operations, and utility routines (including print
).
- functions for handling device memory https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/tensors.hpp#L640
These functions use conditional compilation to allocate and manage device
tensors on GPU when ASGarD is built for GPU execution (ASGARD_USE_CUDA
), with fallback to CPU otherwise.
The tensors component uses the fk
(fully kinetic) namespace primarily to disambiguate our vector class from std::vector
. Originally, "fully kinetic" was part of the code's working name. Because of the ubibquity of linear algebra in the code, replacing fk
with the component name tensors
might be unnecessarily verbose, but a more descriptive namespace might be more appropriate.
The complexity in this component makes modification (or even understanding its API) difficult, and the unit tests are extensive. However, simplifying tensors may require splitting their responsibilities between separate objects; tensors
currently includes both owning and non-owning references to matrices and vectors located on CPU or GPU, with SFINAE used to restrict the API as approriate. Using concepts rather than SFINAE for this purpose could also enhance readability.
Given the current problem state, this component solves for the t+1
state vector. Explicit time advance is performed using a third-order Runge Kutta method (explicit_advance
). The backward Euler technique is used for implicit stepping (implicit_advance
). Other time stepping methods could be used, and many are implemeneted in the MATLAB version of the code.
Because problems must be completely set up for time_advance
functions, and these functions generate essentially the final answer (in wavelet space) for the code, time_advance
unit tests function similarly to integration testing, in that they exercise nearly all of the code. Ensuring ASGarD time_advance
output matches MATLAB output in tests increases our confidence in the end-to-end correctness of ASGarD.
-
adaptive_advance
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/time_advance.hpp#L20
This is the driver routine for advancing time in the code. explicit_
or implicit_advance
are invoked from this function. ASGarD is capable of adapting the grid of active elements (see elements
and adapt
components). When adaptivity is enabled (--do_adapt
), this function first coarsens the grid (removes elements with low-magnitude coefficients), then takes probing "fake" timesteps to determine where refinement (adding child elements for high-magnitude element coefficients) is needed.
-
explicit_advance
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/time_advance.hpp#L32
Advance time with a third order Runge Kutta method driven by A*x
, where A
is the tensor-encoded system matrix and x
is the current state vector (see kronmult
).
-
implicit_advance
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/time_advance.hpp#L42
Advance time with the Backward Euler technique driven by (I-dt*A)\x
. Currently, ASGarD supports a local direct dense solve and a local GMRES solve.
Significant time investment has been made to optimize explicit time advance, which can be run in the distributed setting as well as locally.
Much less work has been done for implicit time advance. We would like to explore additional approaches for distributing the implicit advance (see solver
component).
This component is meant to house auxiliary development code. Currently, this includes a simple timer for measuring performance, and a layer over the assert
macro to avoid unused variable warnings when asserts are disabled (i.e., CMAKE_BUILD_TYPE=Release
).
-
simple_timer
class https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/tools.hpp#L23
start
and stop
routines can wrap execution of a certain function invocation/code block with timer start/stop to measure performance. An identifier string argument is used to pair start
s with stop
s. The report
function then generates a table that shows, for each identifier, avg/med/high/low timings. report
is currently invoked at the end of main
.
-
expect
macro https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/tools.hpp#L16
Execute a no-op (void cast) on the provided value when asserts disabled to prevent unused variable warnings, as variables are sometimes set up only for asserts. Translates to assert
otherwise.
ASGarD has built in profiling options (see https://github.com/project-asgard/asgard/wiki/profiling). The simple_timer
complements these options with a low cost, high-level summary of code performance. Built-in profiling and the simple_timer
report could be used as building blocks in automated performance regresion testing.
ASGarD solves PDEs in wavelet space. However, PDEs are specified in real space, and the solution produced by ASGarD must eventually be converted to real space for analysis. The transformations
component (along with the basis
component) provides routines to convert between wavelet and real space. (Math folks - more info in this overview and in function descriptions please!)
The transform to realspace functionality provided by this component (gen_realspace_transform
/wavelet_to_realspace
) are initial implmenetations that only work locally (i.e., not when multiple ranks are used) and may not be scalable enough for large problems.
-
gen_realspace_transform
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/transformations.hpp#L23
Generate the matrices needed for wavelet to real space conversion.
-
wavelet_to_realspace
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/transformations.hpp#L28
Given a wavelet space vector and the matrices generated by gen_realspace_transform
, convert the wavelet space vector to real space and store it in the in/out real space vector argument.
-
transform_and_combine_dimensions
function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/transformations.hpp#L142
Combine single dimensional real space function arguments (v_functions
) using Kronecker products and transform the resulting multi-dimensional vector to wavelet space. (Math folks - is this right?). Requires a wavelet_transform
object from the basis
component.
This component should be placed in its own namespace.
The real space transform process should be re-designed (strange interface and awkward implementation). Additionally, the current algorithm does not scale well - the math/physics professionals are investigating different approaches.