Component Overview

Component Overview/Notes (as of release v0.4.0)

These notes are meant to provide a brief summary of ASGarD’s major code components. The goal in writing these is to document the code, but also to provide a place where math/physics professionals can add their insight to improve the team’s overall understanding of how the code works. So, this is not a closed writeup – please use your expertise to add detail and fix misconceptions.

adapt

overview

“Adaptivity” describes ASGarD’s capability to add/remove element grid points as needed to more efficiently and accurately model a solution. Elements responsible for relatively large coefficients in the solution may be “refined” (have child elements added to the grid). Elements whose coefficients are very small may be “coarsened” (removed from the grid). The thresholds for coarsening and refinement are controlled by an input argument to ASGarD (--thresh).

primary classes/functions

class distributed_grid https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/adapt.hpp#L37

The distributed_grid is composed of the element table (from elements component) and the distribution_plan (distribution component) that maps each MPI rank in the simulation to the elements for which the rank must compute coefficients. In short, this class bundles information about the elements and how they are distributed, and provides functions to adapt the grid given the solution vector with elements’ coefficient values.

The main routines driving this functionality (distributed_grid::refine and distributed_grid::coarsen) refine/coarsen the element grid by adding/deleting elements from its member elements::table, and then redistributing responsibility for the new element grid across ranks using functions in the distribution component.

notes for improvement

It is reasonable to encapsulate the element table with the distribution plan, since the adaptivity functions require altering the grid and rebalancing the problem across ranks, but providing const & getters for these members is not ideal. It would be better to provide a stricter interface allowing outside components to access the functionality they need without exposing (for read) the objects themselves.
Currently, there is no limit (beyond a level cap) on refinement/coarsening. This introduces two related problems; refinement to the extent that some rank’s portion of the solution vector spills out of RAM (virtually unlimited growth) and coarsening to the extent that fewer elements remain than ranks. We will need to gracefully handle both situations.

basis

overview

Most computations within ASGarD are performed in wavelet space, but the PDEs we want to solve are defined in real space. So, we require a means of transforming vectors/matrices into wavelet space. Most of this functionality is provided in the transformations component, but the basis component handles the operator required for transformations. Originally, this operator was explicitly formed (as a matrix), with application via matrix multiplication as appropriate to transform values. However, because the transform matrix is sparse with large zero regions and a regular pattern of dense blocks, we now form and store only the dense blocks. Functions are provided in the basis component to apply the transform matrix using these blocks, skipping the unnecessary arithmetic large zero regions would introduce.

primary classes/functions

function operator_two_scale https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/basis.hpp#L15

Deprecated. Explicitly forms the transform matrix. Uses generate_multi_wavelets, which is conceptually private but exposed in the interface for testing purposes.

class wavelet_transform https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/basis.hpp#L32

Constructor forms the dense blocks of the transform matrix. apply methods provide the result of matrix multiplication using these dense blocks.

notes for improvement

The dense blocks required for the transform matrix are relatively small and their formation is not costly. So, rather than constructing one instance of this class and passing it around for transforms as needed, we could create instances as needed at function scope to declutter APIs requiring wavelet_transform parameters.

Both the deprecated operator_two_scale and generate_multi_wavelets functions are lengthy and obscure in places. The C++ developers are not necessarily familiar with the underlying math performed by these functions; this makes optimization or clarification of the MATLAB code that served as a template for these routines a challenge. There is an opportunity for someone with math insight to clarify the code here.

batch

overview

“Kronmult” (Kronecker product of matrices followed by matrix-vector multiplication , Ax=b, where A is a tensor product) is one of the core computational kernels in ASGarD. We often avoid explicit formation of the Kronecker product by instead multiplying each Kronecker operand (in turn) directly to the vector x conceptually reshaped as a small matrix. High dimensional/high resolution problems may require O(10^6) or more small kronmults to move the simulation forward at each timestep.

A prior implementation of the time stepping code performed these many kronmults by constructing dimension-many lists (batches) of matrix-multiplication operands and invoking vendor-supplied batched_gemm routines. The batch component was designed to support constructing, storing, and passing pointer lists for batched_gemm.

The batch component is no longer used to drive explicit timestepping in time_advance; the kronmult component now serves this purpose. However, some functions within the component are used for implicit timestepping, and the “kronmult as batched gemm” technique is employed for wavelet to realspace transforms in the transformations component.

primary classes/functions

class batch https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/batch.hpp#L16

Provides storage for fixed-size list of pointers to same-sized matrices for batched_gemm. assign_entry interface stores pointer to fk::matrix with asserts/size checking; assign_raw stores a pointer without such checks. assign_raw was added after performance measurements showed that creating/destroying many matrix views can become costly (mostly due to reference counting using smart pointers in fk::matrix class).

class batch_chain https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/batch.hpp#L118

Given the operands for a kronmult (a list of matrices and a vector), as well as some externally allocated temporary workspace, the batch_chain class internally constructs batch members for batched_gemm. The execute member actually performs the multiplications by calling into the lib_dispatch component. Currently, only used for real space transform.

function batched_gemm / batched_gemv https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/batch.hpp#L71

Thin layer over lib_dispatch functions with the same name; lib_dispatch accepts BLAS-style arguments, so these functions translate/setup for those calls.

function build_system_matrix https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/batch.hpp#L147

The implicit time advance is driven by solving the system matrix. Given the elements::table describing which elements are active, as well as the 1D coefficient matrices from the PDE class, this function explicitly constructs the system matrix.

notes for improvement

This entire component is likely to be removed in the future. build_system_matrix belongs in the time_advance component or elsewhere; the batch_chain/batch classes are only used for the wavelet to real space transforms in transformations. The current implementation for this transform is a proof of concept that does not scale well and is not distributed across nodes.

boundary_conditions

overview

Some problems require the application (by addition) of boundary condition vectors to the solution vector at every timestep. This component a) generates unscaled boundary condition vectors and b) provides functions that scale the boundary condition vectors for the current time. The mathematical/physical significance of these vectors has not generally been understood on the C++ development side, but the above description demonstrates the most salient features of these vectors for C++ development: boundary condition vectors are tied to the current problem and grid (and so must be regenerated after adapting the grid), and must be scaled for time before addition to the solution vector. These features are shared with source vectors, although the implementations are different.

primary classes/functions

function make_unscaled_bc_parts https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/boundary_conditions.hpp#L15

Given a PDE and grid, returns a pair of “unscaled_bc_parts” representing the left and right unscaled boundary condition vectors. The start and stop element integer parameters are used to generate only part of the vectors; these parameters are used to generate only the part of the vector assigned to a rank in the distribution plan.

function generate_scaled_bc https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/boundary_conditions.hpp#L21

Accepts the unscaled boundary conditions and current time as input, and scales the vectors appropriately.

notes for improvement

This component has a strange interface – an unscaled boundary condition vector is represented by a vector of vectors of vectors of fk::vectors. This three level nesting is the result of each partial term in each term in each dimension contributing to the solution. However, we already have that exact structure in the PDE class – the PDE contains dimensions and terms that contain partial terms. Rather than existing in an isolated component, each partial term should be responsible for producing its own unscaled part, and the PDE should be responsible for providing the boundary condition vector. Housing source/boundary condition vector storage and generation in the PDE class makes sense logically, would declutter the API, and provide an opportunity to refactor these functions (which could stand a rewrite). The creation and application of source vectors is somewhat similar, and should also probably be consolidated in the PDE class.

coefficients

overview

Each PDE defines dimension-many coefficient matrices (sometimes called operator matrices”) for each partial term. These partial term matrices are chained together by matrix multiplication to form a single coefficient matrix for each term and dimension. These matrices are square, dense, and 2^level * degree in size.

Information from elements::table is used to draw values from these matrices for either 1) building the system matrix for implicit time advance or 2) performing system matrix/state vector multiplication for explicit time advance (or, in future, implicit time advance driven by iterative solve). Specifically, each element in the table contains a series of coordinates that designate, for each dimension, where that element should index into the coefficient matrices.

Currently, all coefficients are stored from construction onward on the GPU (if ASGarD is built to use GPUs). Only time-independent coefficients are supported at present, but we expect to support time-dependent coefficients as well.

primary classes/functions

function generate_all_coefficients https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/coefficients.hpp#L7

Takes an input/output PDE argument and fills its coefficient matrices by generating all partial term coefficient matrices for each and chaining them together by multiplication.

function generate_coefficients https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/coefficients.hpp#L12

Generates one partial term coefficient matrix. Invoked from generate_all_coefficients.

notes for improvement

A PDE’s coefficients (at least, the time-independent ones) should be constructed when the PDE is built (RAII). There is no great reason this is not already done. Aside from that, matrix coefficient generation and update functions could easily (and probably should) be member functions of the PDE class. This rework could take place when the planned overhaul of the PDE class occurs.

Also, we currently only support time-independent coefficients. We should add regeneration of time-dependent coefficients in the time advance loop.

distribution

overview

The distribution component handles communication between MPI ranks when ASGarD is run with multiple ranks (nodes/GPUs). ASGarD expects one rank per GPU when built to use GPUs. To simplify problem decomposition, we also currently require an even or perfect square number of ranks. If an odd/nonsquare number of ranks is used, ASGarD will detect this condition and “round down” to the nearest valid number of ranks. Note ASGarD’s distribution only handles the explicit time advance case for now; running implicit time advance with multiple ranks will produce an error message and early exit.

ASGarD decomposes the problem across ranks by decomposing the element grid (see elements section), since each element connection contains the information needed to produce a partial solution. We divide the grid into rectangular subgrids as “squarely” as possible given the number of ranks ASGarD is run with (see get_num_subgrid_cols function). We tile these squares in row major order over the element grid, assigning the element connections within each tile to a single rank. The last subgrid in each tile row may be smaller than the others, as the number of elements may not divide the tile size.

The current implementation will produce tiles containing at least as many tile rows as tile columns (i.e., there will be at least as many vertical lines as there are horizontal lines dividing the elment grid). All subgrids within a tile row will write (additively) to coefficients in the same region of y. So, each tile row must update y atomically (current approach) or write separately and reduce into y (previous approach). All subgrids within a tile column will read from the same segment of x.

Two major interprocess communication tasks are required by ASGarD and provided by this component – 1) decomposing/distributing the solution vector and 2) distributing changes to elements::table after element grid adaptation. The first task is accomplished by a balanced pairing of senders with receivers, where senders write to the region of y that receivers need for the next timestep x. Once this pairing is complete, point-to-point communication is used to exchange vector segments. The second task is accomplished by all-to-all communication to form consensus on the new element table, followed by a redistribution of x such that all ranks have the information they need for the next timestep. Task 1 code was implemented first, followed by task 2. Efforts were made to consolidate/share code between these two tasks, but some duplication exists as a result of dissatisfaction with task 1 implementation.

primary classes/functions

helper struct node_out https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L165

Allows for printing a message once for each node, so that output isn’t clobbered by every local rank’s output.

struct message https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L199

Simple data object representing an MPI point-to-point message that should be sent. Fields for direction (send/receive), and a range of elements whose coefficients should be transferred.

function get_plan https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L187

Generates a map from each rank number (0, .. , n-1 for n ranks) to a subgrid representing all the element connections assigned to that rank. Invokes get_subgrid for each rank.

task 1 functions

function reduce_results https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L233

Deprecated since we use atomic update into y now. Previously, each rank within a tile row would use this function to reduce y by gathering each row member’s contribution.

function exchange_results https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L243

Takes as input source (y), and dest (x_next) for each rank, as well as the distribution plan. Partners ranks and transfers data using MPI from senders source vector to receivers’ dest vector to ensure all ranks have correct x_next for the next timestep.

function gather_errors https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L250

Gathers RMSE figures from all ranks on node, for display after each timestep.

function gather_results https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L255

Gathers all coefficients in a single vector. Note we use a std::vector due to its 64 bit size field; our fk::vectors are currently 32-bit indexed.

task 2 functions

function get_global_max https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L272

Adaptivity threshold is based on max coefficient value; find maximum across all ranks.

function distribute_table_changes https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L276

Share changes to element table across ranks.

function redistribute_vector https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/distribution.hpp#L296

After consolidating the new element table, redistribute x so that all ranks have coefficients (or zeros in the case of refined elements) for the next timestep. Old plan and new plan size should match; assumes elements are only appended to the list of active elements, or deleted with elements shifted left to fill deleted regions.

notes for improvement

The helper functions for task 1 are unnecessarily complex (over engineered). Lots of simplification needed in pairing/partnering code.
Not in a namespace like the newer components. Should be placed in a distribution:: namespace.
MPI performance has not been an emphasis for us, as we have focused on adding features/node local performance. Performance evaluation needed.

elements

overview

Each of the “elements” represented/managed by this component have three primary properties:

an arbitrary but consistent index in the list of active elements. This determines where each element i reads from the system state vector x (x_i) and writes to the next state y (y_i).
coordinates that determine where information from operators (coefficient matrices) should be retrieved to apply to x_i to yield y_i. Each element has dimension-many {level, cell} coordinates.
a unique integral id used to identify elements.

We assume full element connectivity, so for each pair of elements, a connection exists that represents some piece of the work that must be done to form the system matrix (implicit time advance) or apply the system matrix to the state vector (Ax, explicit time advance) to advance time. The elements component contains helper functions for working with elements, as well as the elements::table object that manages a list of active elements, their ordering, and their coordinates.

In short, each element connection is a unit of work that must be done to advance time. The part of the system matrix the element connection requires to advance time is row-indexed by the element’s ordering, and column-indexed by the connected element’s ordering. Consequently, x_i is indexed by the connected element’s ordering, and y_i is indexed by the element’s ordering. Elements contain coordinates that locate submatrices within the single-dimensional coefficient matrix portions required to advance from x_i to y_i when the system matrix is not explicitly formed. See kronmult component.

primary classes/functions

class elements::table (additional documentation in code) https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/elements.hpp#L56

Given a PDE and the number of hierarchical basis levels/basis degree, the elements::table constructor applies a sparsity rule to build a list of active elements. This class also includes functions to enable/disable elements when the element grid is adapted (see adaptivity component). get_child_elements function is provided to assist with refinement.

Element coordinates can be retrieved with the get_coords member function, but a flattened version of the element table (with all active coordinates stored contiguously in memory as ordered by the table) is also prepared to assist with on-device kronmult.

functions map_to_id and map_to_coords https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/elements.hpp#L23

Translate between an elements’ unique integral id and coordinates.

functions get_level_cell and get_1d_index https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/elements.hpp#L16

Translate between a {level, cell} coordinate and a unique id for that coordinate.

notes for improvement

This is one of the clearer components. One potential improvement is in the relationship between the elements::table and adapt::distributed_grid. The distributed_grid contains an element table member. This makes sense because the distributed_grid must add/remove elements from the table during adaptivity. However, the table can be retrieved with a simple getter. It would be better to provide a more limited interface.

fast_math

overview

This simple component primarily exists as a layer over lib_dispatch within the fm namespace. The main functions in this component accept vector/matrix class (see tensors component) arguments and invoke lib_dispatch to perform BLAS operations. The tensor class also has member functions that invoke BLAS (e.g., fk::matrix * fk::vector operator invokes gemv), but the fast_math API allows the user additional flexibility. For example, the fm::gemm function allows the user to specify a view or owning tensor argument C (the result), whereas the member functions will always allocate a new object for C (constraint of return dynamics).

Some utility math functions may also be placed here, when they do not correspond directly to MATLAB functionality – otherwise, those functions are placed in matlab_utilities.

primary classes/functions

BLAS-like functions (scal/gemv/gemm/nrm2/copy/axpy/etc.) https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/fast_math.hpp#L156

Perform BLAS operations on tensors.

notes for improvement

Could be extended as needed with more functionality – not all BLAS or common BLAS extensions are available here.

io

overview

This very small component wraps the HighFive header-only HDF5 library. Its only current use is writing the initial condition/solution vector to disk. To enable io, users must 1) build ASGarD with ASGARD_IO_HIGHFIVE defined, and 2) select a wavelet and/or realspace write frequency at runtime (see ./asgard –help).

primary classes/functions

initialize_output_file function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/io.hpp#L13

Initializes an HDF5 dataspace and writes the initial condition to disk.

update_output_file function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/io.hpp#L43

Writes the solution vector to disk.

notes for improvement

Implemented before component namespaces – functions should exist within an io namespace.

Physicists may require additional output information beyond the solution vector, e.g. the adapted grid for a given timestep. This component could be extended to write more information.

Additionally, not designed to work in the distributed case. In this case, every local rank will write over its fellow local rank filenames. There is also no provision for writing the final solution gathered from all ranks.

kronmult

overview

This component provides an interface to kronmult_cuda, where the functions that handle the core computational kernel in explicit timestepping (and in future iterative implicit timestepping) – conceptually, y = Ax, where A is the system matrix representing PDE operators, x is the current state vector and y is the next state vector. For these routines, the tensor-encoded A is not explicitly formed. Instead, for each element connection, element coordinates are used to retrieve small, dense submatrices {A_0, … , A_ndims-1} from each single-dimensional operator matrix (tensor “factor” of A). For each element connection (and each term) we compute a “kronmult” (Kronecker product * vector): y_i = (A_0 kron A_1 kron … A_ndims-1) * x_i. We do not explicitly form the Kronecker product used to compute y_i; the artihmetic is reformulated as a series of small matrix-matrix multiplications.

Kronmults are not actually performed by this component - kronmult_cuda handles that – but this component provides the necessary setup routines for problem decomposition and memory allocation.

primary classes/functions

function execute (public) https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/kronmult.hpp#L19

Given all the objects necessary to represent the problem (PDE with coefficient matrices, elements::table, state vector x, subgrid assigned to rank) and workspace constraints, decompose the problem such that it fits within the workspace, and perform Ax, returning y. Invokes functions to 1) allocate memory (or realloc if the grid is adapted), 2) decompose the problem into smaller parts that fit in memory, 3) build pointer lists for kronmults, and 4) execute kronmults to perform Ax using kronmult_cuda component functions. Modifications to the underlying “kronmult” library (https://github.com/project-asgard/kronmult) not yet included in ASGarD move the allocation and decomposition tasks into the kronmult library. Using this newer version will require stripping the kronmult component of these roles.

function decompose https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/kronmult.cpp#L94

Function that breaks a rank’s assigned subgrid into a set of smaller ones such that the problem fits in memory. execute function iterates over the returned subgrid set to perform all the necessary work for Ax.

function execute (private) https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/kronmult.cpp#L276

Invokes the kronmult_workspace::get_workspace function to allocate/reallocate workspace memory, and then calls into kronmult_cuda functions to execute one partial subgrid’s kronmult. Iteratively invoked by the public execute function for each partial subgrid.

class kronmult_workspace https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/kronmult.cpp#L145

Allocates/reallocates workspace for kronmult. The workspace is dominated by the size of workspace needed for “intermediate products” during each kronmult. Each kronmult requires computing: y_i = (A_0 kron A_1 kron … A_ndims-1) * x_i, but we perform this math as a chain of small matrix-matrix products. Thus, we need space to store the input from the last multiplication, and the output for the current multiplication – 2 workspaces per kronmult. To compute workspace size, note the number of elements in x_i and y_i is degree^ndim, where degree is the degree of the Legendre basis polynomials (-d argument at command line) and ndim is the number of dimensions in the PDE being solved. Each matrix A_i in the Kronecker product is square with degree rows. So, the size of intermediate products is also degree^dim. We need two such workspaces (input/output) for each kronmult, so the total workspace size is num_kronmults * deg^dim.

Additional space is also required for building the pointer lists that are passed into kronmult_cuda to perform the arithmetic. These lists contain pointers to elements’ A_i matrices, as well as x_i and y_i for each element, along with the workspace region for each element.

Adaptivity may increase/decrease the number of elements in the element grid (and thus, the number of kronmults we need to comptue). This event will trigger reallocation. Reallocation will not occur if the number of elements assigned to a rank stays the same or decreases. Because workspaces are allocated on GPU (slow), we avoid reallocation when possible.

notes for improvement

The newest kronmult library handles decomposition and allocation internally; these parts of the kronmult component should eventually be stripped out. Performance regression testing should be used to ensure the new allocation scheme is performant relative to the current implementation.

Optimizations could be made to pointer lists, as well. Given that we already have an ordering of elements, we could just use element number and offset from x and y to compute an elements x_i, y_i, and workspace position.

kronmult_cuda (in `src/device` directory)

overview

This short-sightedly named kronmult_cuda component (should be kronmult_device) contains the three functions that prepare for and call into the kronmult library (https://github.com/project-asgard/kronmult) to concurrently perform some number of kronmults (see kronmult component). These functions will be run on GPU if ASGARD_USE_CUDA is defined at build time, with fallback to OpenMP if it is not. So this component, along with the kronmult library, is the computational core of the code for explicit timestepping, and for iterative implicit stepping (once implemented).

primary classes/functions

function stage_inputs_kronmult https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/device/kronmult_cuda.hpp#L4

Each kronmult uses one input workspace and one output workspace (see kronmult section). This function copies x_i into the workspace for each kronmult to provide the initial input for the kronmult’s matrix multiplication chain.

function prepare_kronmult https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/device/kronmult_cuda.hpp#L8

The prepare function builds pointer lists for each kronmult. These pointer lists include the position of operator submatrices and x_i/y_i segments for each kronmult. The kronmult library uses this information to read/store information in the expected place for each kronmult.

function call_kronmult https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/device/kronmult_cuda.hpp#L18

Sends the pointer lists built by the prior routine to the kronmult library for computation. If built with CUDA, ASGarD utilizes one thread block for each kronmult.

notes for improvement

These kernels were profiled and optimized at the 2020 NERSC Hackathon, but additional optimization is possible.

Additionally, if we coordinate the ordering of elements between kronmult library and kronmult_cuda component, we could use base pointer + offset techniques to reduce pointer list sizes.

Despite being added after our component namespace policy was implemented, this component is frustratingly not in its own namespace (unambiguously Tyler’s fault).

Finally, we could “hipify” these kernels. Along with a “hipified” kronmult library and some other minor changes (e.g., warp/thread block sizing), we could then run ASGarD (explicit stepping) on AMD GPUs. In this case, we could generalize the component name to kronmult_device or similar.

lib_dispatch

overview

This simple component provides templated precision/device execution wrappers for BLAS functions, which are primarily invoked by the tensor and fast_math components. Wrapper signatures mimic the BLAS API, with the precision identifier (e.g. s or d) dropped. Currently, we do not support complex types, as there is not yet any complex arithmetic in ASGarD. The component does not provide a complete BLAS; functions are added as needed by 1) adding an extern "C"-qualified declaration for the BLAS functions to add and 2) adding a wrapper templated on precision than invokes BLAS/CUBLAS functions.

Supporting non-NVIDIA devices will require extension to this component (as well as kronmult_cuda and tensors).

primary classes/functions

gemm/gemv/etc. wrapper functios, e.g.: https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/lib_dispatch.hpp#L135

Invokes BLAS/CUBLAS routines with correct precision.

struct device_handler https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/lib_dispatch.cpp#L33

This statically initialized struct prepares the CUBLAS library when ASGarD is built for GPU execution (ASGARD_USE_CUDA).

notes for improvement

We will likely need additional BLAS functions - these can be added easily by supplying more wrapper functions with tests.

Supporting non-NVIDIA devices will require modifying both the device_handler and adding additional arguments/paths to wrappers.

matlab_utilities

overview

The project's mathematicians and physicists develop the methods/mathematical techniques for ASGarD in a separate MATLAB implementation: https://github.com/project-asgard/DG-SparseGrid. New features are generally added to ASGarD after math/physics professionals first implement a MATLAB "template". Then, test data (inputs/outputs) are generated from the MATLAB implementation and stored in [ASGarD root]/testing/generated-inputs/[component-name]. Finally, a C++ version of the feature is implemented based on the MATLAB "template", with modifications made for performance and design consistency, and tests are added to ensure the C++ feature faithfully replicates the functionality of the MATLAB template.

Because of this MATLAB-to-C++ development process, it is helpful to implement some MATLAB built-ins in C++. These functions are placed in the matlab_utilities component. Note that given the differences between MATLAB and C++, the C++ functions may not exactly match the MATLAB API. Additionally, testing against MATLAB requires functions to read MATLAB-written vectors/matrices from disk, and these functions are also housed in this component.

This component is primarily used in the routines that initially set up the problem before enterign the loop that advances time in the code.

primary classes/functions

read_[scalar/vector/matrix]_from_txt_file https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/matlab_utilities.hpp#L123

These functions are designed to read data written by functions in the MATLAB repository (specifically, the write_octave_like_output function) for testing.

find function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/matlab_utilities.hpp#L79

Returns the indices in an fk::vector/fk::matrix for which the supplied predicate returns true.

eye funcion https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/matlab_utilities.hpp#L61

Return an identity matrix with the given size.

linspace function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/matlab_utilities.hpp#L56

Returns a vector with N linearly-spaced elements between the provided start and end values.

polyval function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/matlab_utilities.hpp#L64

Evaluate a polynomial. Specifically, yields y = p(0)*x^n + p(1)*x^(n-1) + ... + p(n-1)*x + p(n), where p is a vector of length n whose elements are the coefficients of the polynomial in descending power order. x can also be vector-valued; in this case, a vector y is returned the the polynomial evaluated for each element in x.

notes for improvement

This is an older component implemented before the component-namespace policy. These functions should be placed in one namespace.

pde

overview

The PDE class is one of the foundational abstractions in the code, representing the problem we seek to solve with ASGarD. PDE is defined in pde/pde_base.hpp; this component merely provides a factory routine to instantiate a PDE given an enum type argument.

primary classes/functions

make_PDE function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/pde.hpp#L45

Given a PDE_opts (defined in program_options component) argument, construct and return a PDE (defined in pde/pde_base.hpp). Because each specific PDE solved by ASGarD is modeled by a class derived from the base PDE class, the factory returns a unique_ptr<PDE> (see https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rc-factory).

notes for improvement

The PDE class is overly complex, and adding a specific PDE to the code requires lots of steps: creating a class derived from PDE, adding an enum value (PDE_opts) in program_options, adding a switch in the factory method in this component, and adding tests in many separate test files (boundary_conditions, pde, time_advance, coefficients...). After overhauling the way PDEs are specified, this constructor may need modification.

This component is also not isolated within its own namespace, as the newer components are.

pde_base (in `pde/` directory)

overview

The PDE class defined in this component represents the interface for the problem ASGarD will solve. Derived classes (also found in the pde/ directory) model specific problems.

A PDE with ndims dimensions is primarily composed of:

ndims many dimension objects, each of which defines the number of hierarchical basis levels and domain for one dimension.
An arbitrary number of functions to define source vectors. These are added to the solution vector when time is advanced.
One or more term_set objects. Each term_set must define dimension-many single dimensional terms, each with its own coeffcient matrix (see coefficients, kronmult).
Each term has one or more partial_term objects. These partial terms each define a coefficient matrix. Each term's coefficient matrix is formed by multiplying together its partial_term coefficient matrices.

primary classes/functions

PDE class.

This class is the interface for problems solved with ASGarD. dimension, term, and other objects defined in this component are building blocks for PDE. The constructor is invoked once during problem setup; its arguments are passed from values statically defined in the specific (derived) PDE in the pde/ directory requested by the user (-p argument). The single dimensional coefficient matrices that together conceptually represent the system matrix A are stored in this class, and can be retrieved with the get_coeffcients function.

notes for improvement

Of all the components in ASGarD, this one needs the most work. Ideally, we would like to specify PDEs in a more natural way, as (potentially non-expert) users will need to perform this task to solve custom problems. A DSL for representing PDEs could be one approach; this could be especially useful if DSL files could be read by both MATLAB and C++, since the process of porting PDEs between languages is fraught with opportunities to make mistakes.

The primary challenge in "fixing" the way PDEs are represented in ASGarD is specifying, storing, and passing around functions. Statically defining function sets, as we do now, works. But, this is far from the cleanest/most efficient way to add problems to the code.

permutations

overview

This component provides the level component of the level and cell coordinates for all elements that will be included in the initial element grid ("active" elements). Each element in the grid must have ndims*2 coordinates - a level and cell coordinate for each dimension (see elements component). The elements::table constructor uses permutations functions to generate a list of level coordinate sets to be included in the grid. Each set of coordinates in the list will add multiple elements to the grid - the elements component determines all valid cell coordinates (see elements::table::get_cell_index_set) for each level coordinate set.

primary classes/functions

get_max_multi function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/permutations.hpp#L46

Build all valid level coordinate sets (full grid).

get_lequal_multi function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/permutations.hpp#L39

Use a sparsity rule to filter valid level coordinates (sparse grid). The currently implemented rule excludes level coordinate sets where the sum of level coordinates exceeds the maximum initial level (-l argument to ASGarD) across dimensions.

notes for improvement

This component serves its limited role well. It is an almost direct port of code in the MATLAB implementation.

program_options

overview

This component exists to parse command line input, and to store runtime options that are not PDE-related (those options are stored in the PDE class). A list of command line options can be printed to the console using ./asgard --help. PDEs (specific problems) that can be solved by ASGarD can be listed using ./asgard --available-pdes.

primary classes/functions

parser class https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/program_options.hpp#L123

This class accepts command line input via its constructor, which parses arguments using the Clara header-only library. Getter functions are provided for retrieving user values. The is_valid function returns true when user-provided values fall within acceptable ranges.

options class https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/program_options.hpp#L281

The parser is used to construct both the PDE object that represents the problem being solved, and the options object that stores other user arguments. The options object contains 1) public const member variables where user-provided arguments are stored and 2) simple helpers that return whether or not ASGarD should write wavelet or realspace outputs at a given timestep.

notes for improvement

The parser object communicates when user-provided values fall outside acceptable ranges (e.g. negative number of hierarchical basis levels) by returning false from the is_valid function. There may be a better way to communicate when user-provided values are invalid, but any solution should facilitate unit testing (e.g. cannot simply call exit). The unit testing framework used by ASGarD (Catch) may provide functions that expect exceptions to be thrown for testing, but we have avoided all exceptions in general.

The Clara library is no longer maintained. There may be a maintained fork; we have not searched for one yet.

quadrature

overview

The coefficients component requires evaluating Legendre polynomials (or their derivatives) at certain values (math folks please add details!).

primary classes/functions

legendre function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/quadrature.hpp#L15

Evaluate degree many Legendre functions and their derivatives for the input domain. The function returns a std::array of fk::matrices. Element zero has a row for each value in the input domain; each of the degree many columns represents a Legendre polynomial evaluated at that value. Element one is structured similarly, except that the first derivative of each polynomial is evaluated.

legendre_weights function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/quadrature.hpp#L26

(Taken from MATLAB. I have no idea. Math folks help!)

Computes the Legendre-Gauss nodes (roots on x) and weights on an interval [lower_bound, upper_bound] for Legendre polynomial of degree num_points. These are later used for computing definite integrals using Legendre-Gauss Quadrature.

notes for improvement

The standard library may contain overlapping functionality; we may be able to substitute some of these functions to reduce our maintainence/testing burden (https://en.cppreference.com/w/cpp/numeric/special_functions/legendre).

These functions should be placed within a quadrature namespace.

solver

overview

Implicit timestepping methods in ASGarD are driven by a direct dense solve y = I - dt*A \ x where A is the system matrix and x is the current state vector (see time_advance). Currently, we explicitly form A for this purpose when implicit timestepping is chosen by the user (-i argument to ASGarD). However, we have not yet implemented a distributed direct solve, so implicit timestepping is limited to a single node. Also, the math professionals on the project are exploring other techniques for this solve requirement, including iterative methods like GMRES. The solver component represents our (limited) exploration of iterative methods in the C++ code - a single, simple GMRES function that locally performs GMRES for an explicitly formed A. However, since A is very large and we already have a technique for distributed A*x without forming A in the kronmult code, we could eventually implement distributed, matrix-free GMRES atop kronmult.

The solve technique used in implicit stepping can be selected with the -s argument to ASGarD. See ./asgard --help.

primary classes/functions

function simple_gmres https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/solver.hpp#L8

Given system matrix A, state vector x, initial guess b, and some tolerance, approximate the solution to x = A\b with residual less than abs(tolerance). A preconditioner M can optionally be provided.

notes for improvement

This is a simple first step toward iterative methods in ASGarD. As noted in the overview, we could eventually implement a distributed GMRES driver here that calls into the kronmult code for A*x.

tensors

overview

The tensors component houses the fk::matrix and fk::vector classes in the code with support for linear algebra operations through both member functions/operators and the fast_math component.

Both fk::vector and fk::matrix can represent owning references or non-owning views, as well as CPU or GPU-resident data (see View Semantics: https://github.com/project-asgard/asgard/wiki/view-semantics). In addition to a template parameter describing a tensor's floating point precision, non-type template parameters and SFINAE are used to restrict the API and signal to other components when operations should take place on CPU or GPU. A tensors' "memory type" template parameter indicates whether it is owning or non-owning, and its "resource type" parameter indicates whether it is backed by CPU or GPU memory. For instance, the const_view memory type indicates a non-owning, read-only view of an underlying owning tensor, and the resize member function is disabled by SFINAE for this type. Many member functions that require direct random access are disabled for device resource type tensors (those on GPU), but not host tensors.

primary classes/functions

vector class https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/tensors.hpp#L97

This extensive class represents a linear algebra vector. No distinction is made between row and column vectors. Because vectors represent a region of contiguous memory, both vector and matrix views can be constructed from vector owners. So, vectors are a means of pre-allocating some amount of flexible linear algebra memory. Member functions include various constructors for direct/copy/move construction of owners and views on CPU or GPU, as well as precision conversion/truncation "copy constructors" that translate between vectors of differing floating point precision. transfer_from and clone_onto_[host/device] functions provide a means of transferring vectors to/from GPU. Math operators, including + and * are defined to simplify arithmetic operations; note that fast_math provides a more flexible API for these operations. Finally, utility fuctions, including print, are useful for debugging.

matrix class https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/tensors.hpp#L349

Like vector, the matrix class contains constructors/destructor, precision conversion routines, arithmetic operations, and utility routines (including print).

functions for handling device memory https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/tensors.hpp#L640

These functions use conditional compilation to allocate and manage device tensors on GPU when ASGarD is built for GPU execution (ASGARD_USE_CUDA), with fallback to CPU otherwise.

notes for improvement

The tensors component uses the fk (fully kinetic) namespace primarily to disambiguate our vector class from std::vector. Originally, "fully kinetic" was part of the code's working name. Because of the ubibquity of linear algebra in the code, replacing fk with the component name tensors might be unnecessarily verbose, but a more descriptive namespace might be more appropriate.

The complexity in this component makes modification (or even understanding its API) difficult, and the unit tests are extensive. However, simplifying tensors may require splitting their responsibilities between separate objects; tensors currently includes both owning and non-owning references to matrices and vectors located on CPU or GPU, with SFINAE used to restrict the API as approriate. Using concepts rather than SFINAE for this purpose could also enhance readability.

time_advance

overview

Given the current problem state, this component solves for the t+1 state vector. Explicit time advance is performed using a third-order Runge Kutta method (explicit_advance). The backward Euler technique is used for implicit stepping (implicit_advance). Other time stepping methods could be used, and many are implemeneted in the MATLAB version of the code.

Because problems must be completely set up for time_advance functions, and these functions generate essentially the final answer (in wavelet space) for the code, time_advance unit tests function similarly to integration testing, in that they exercise nearly all of the code. Ensuring ASGarD time_advance output matches MATLAB output in tests increases our confidence in the end-to-end correctness of ASGarD.

primary classes/functions

adaptive_advance function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/time_advance.hpp#L20

This is the driver routine for advancing time in the code. explicit_ or implicit_advance are invoked from this function. ASGarD is capable of adapting the grid of active elements (see elements and adapt components). When adaptivity is enabled (--do_adapt), this function first coarsens the grid (removes elements with low-magnitude coefficients), then takes probing "fake" timesteps to determine where refinement (adding child elements for high-magnitude element coefficients) is needed.

explicit_advance function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/time_advance.hpp#L32

Advance time with a third order Runge Kutta method driven by A*x, where A is the tensor-encoded system matrix and x is the current state vector (see kronmult).

implicit_advance function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/time_advance.hpp#L42

Advance time with the Backward Euler technique driven by (I-dt*A)\x. Currently, ASGarD supports a local direct dense solve and a local GMRES solve.

notes for improvement

Significant time investment has been made to optimize explicit time advance, which can be run in the distributed setting as well as locally.

Much less work has been done for implicit time advance. We would like to explore additional approaches for distributing the implicit advance (see solver component).

tools

overview

This component is meant to house auxiliary development code. Currently, this includes a simple timer for measuring performance, and a layer over the assert macro to avoid unused variable warnings when asserts are disabled (i.e., CMAKE_BUILD_TYPE=Release).

primary classes/functions

simple_timer class https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/tools.hpp#L23

start and stop routines can wrap execution of a certain function invocation/code block with timer start/stop to measure performance. An identifier string argument is used to pair starts with stops. The report function then generates a table that shows, for each identifier, avg/med/high/low timings. report is currently invoked at the end of main.

expect macro https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/tools.hpp#L16

Execute a no-op (void cast) on the provided value when asserts disabled to prevent unused variable warnings, as variables are sometimes set up only for asserts. Translates to assert otherwise.

notes for improvement

ASGarD has built in profiling options (see https://github.com/project-asgard/asgard/wiki/profiling). The simple_timer complements these options with a low cost, high-level summary of code performance. Built-in profiling and the simple_timer report could be used as building blocks in automated performance regresion testing.

transformations

overview

ASGarD solves PDEs in wavelet space. However, PDEs are specified in real space, and the solution produced by ASGarD must eventually be converted to real space for analysis. The transformations component (along with the basis component) provides routines to convert between wavelet and real space. (Math folks - more info in this overview and in function descriptions please!)

The transform to realspace functionality provided by this component (gen_realspace_transform/wavelet_to_realspace) are initial implmenetations that only work locally (i.e., not when multiple ranks are used) and may not be scalable enough for large problems.

primary classes/functions

gen_realspace_transform function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/transformations.hpp#L23

Generate the matrices needed for wavelet to real space conversion.

wavelet_to_realspace function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/transformations.hpp#L28

Given a wavelet space vector and the matrices generated by gen_realspace_transform, convert the wavelet space vector to real space and store it in the in/out real space vector argument.

transform_and_combine_dimensions function https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/transformations.hpp#L142

Combine single dimensional real space function arguments (v_functions) using Kronecker products and transform the resulting multi-dimensional vector to wavelet space. (Math folks - is this right?). Requires a wavelet_transform object from the basis component.

notes for improvement

This component should be placed in its own namespace.

The real space transform process should be re-designed (strange interface and awkward implementation). Additionally, the current algorithm does not scale well - the math/physics professionals are investigating different approaches.

Component Overview

Component Overview/Notes (as of release v0.4.0)

adapt

overview

primary classes/functions

notes for improvement

basis

overview

primary classes/functions

notes for improvement

batch

overview

primary classes/functions

notes for improvement

boundary_conditions

overview

primary classes/functions

notes for improvement

coefficients

overview

primary classes/functions

notes for improvement

distribution

overview

primary classes/functions

task 1 functions

task 2 functions

notes for improvement

elements

overview

primary classes/functions

notes for improvement

fast_math

overview

primary classes/functions

notes for improvement

io

overview

primary classes/functions

notes for improvement

kronmult

overview

primary classes/functions

notes for improvement

kronmult_cuda (in src/device directory)

overview

primary classes/functions

notes for improvement

lib_dispatch

overview

primary classes/functions

notes for improvement

matlab_utilities

overview

primary classes/functions

notes for improvement

pde

overview

primary classes/functions

notes for improvement

pde_base (in pde/ directory)

overview

primary classes/functions

notes for improvement

permutations

overview

primary classes/functions

notes for improvement

program_options

overview

primary classes/functions

notes for improvement

quadrature

overview

primary classes/functions

notes for improvement

solver

overview

primary classes/functions

notes for improvement

kronmult_cuda (in `src/device` directory)

pde_base (in `pde/` directory)