Skip to content

Component Overview

Tyler McDaniel edited this page Feb 25, 2021 · 29 revisions

Component Overview/Notes (as of v0.4.0)

These notes are meant to provide a brief summary of ASGarD’s major code components. The goal in writing these is to document the code, but also to provide a place where math/physics professionals can add their insight to improve the team’s overall understanding of how the code works. So, this is not a closed writeup – please use your expertise to add detail and fix misconceptions.

adapt

overview

“Adaptivity” describes ASGarD’s capability to add/remove element grid points as needed to more efficiently and accurately model a solution. Elements responsible for relatively large coefficients in the solution may be “refined” (have child elements added to the grid). Elements whose coefficients are very small may be “coarsened” (removed from the grid). The thresholds for coarsening and refinement are controlled by an input argument to ASGarD (--thresh).

primary classes/functions

  • class distributed_grid

The distributed_grid is composed of the element table (from elements component) and the distribution_plan (distribution component) that maps each MPI rank in the simulation to the elements for which the rank must compute coefficients. In short, this class bundles information about the elements and how they are distributed, and provides functions to adapt the grid given the solution vector with elements’ coefficient values.

The main routines driving this functionality (distributed_grid::refine and distributed_grid::coarsen) refine/coarsen the element grid by adding/deleting elements from its member elements::table, and then redistributing responsibility for the new element grid across ranks using functions in the distribution component.

notes for improvement

  • It is reasonable to encapsulate the element table with the distribution plan, since the adaptivity functions require altering the grid and rebalancing the problem across ranks, but providing const & getters for these members is not ideal. It would be better to provide a stricter interface allowing outside components to access the functionality they need without exposing (for read) the objects themselves.

  • Currently, there is no limit (beyond a level cap) on refinement/coarsening. This introduces two related problems; refinement to the extent that some rank’s portion of the solution vector spills out of RAM (virtually unlimited growth) and coarsening to the extent that fewer elements remain than ranks. We will need to gracefully handle both situations.

basis

overview

Most computations within ASGarD are performed in wavelet space, but the PDEs we want to solve are defined in real space. So, we require a means of transforming vectors/matrices into wavelet space. Most of this functionality is provided in the transformations component, but the basis component handles the operator required for transformations. Originally, this operator was explicitly formed (as a matrix), with application via matrix multiplication as appropriate to transform values. However, because the transform matrix is sparse with large zero regions and a regular pattern of dense blocks, we now form and store only the dense blocks. Functions are provided in the basis component to apply the transform matrix using these blocks, skipping the unnecessary arithmetic large zero regions would introduce.

primary classes/functions

  • function operator_two_scale

Deprecated. Explicitly forms the transform matrix. Uses generate_multi_wavelets, which is conceptually private but exposed in the interface for testing purposes.

  • class wavelet_transform

Constructor forms the dense blocks of the transform matrix. apply methods provide the result of matrix multiplication using these dense blocks.

notes for improvement

The dense blocks required for the transform matrix are relatively small and their formation is not costly. So, rather than constructing one instance of this class and passing it around for transforms as needed, we could create instances as needed at function scope to declutter APIs requiring wavelet_transform parameters.

Both the deprecated operator_two_scale and generate_multi_wavelets functions are lengthy and obscure in places. The C++ developers are not necessarily familiar with the underlying math performed by these functions; this makes optimization or clarification of the MATLAB code that served as a template for these routines a challenge. There is an opportunity for someone with math insight to clarify the code here.

batch

overview

“Kronmult” (Kronecker product of matrices followed by matrix-vector multiplication , Ax=b, where A is a tensor product) is one of the core computational kernels in ASGarD. We often avoid explicit formation of the Kronecker product by instead multiplying each Kronecker operand (in turn) directly to the vector x conceptually reshaped as a small matrix. High dimensional/high resolution problems may require O(10^6) or more small kronmults to move the simulation forward at each timestep.

A prior implementation of the time stepping code performed these many kronmults by constructing dimension-many lists (batches) of matrix-multiplication operands and invoking vendor-supplied batched_gemm routines. The batch component was designed to support constructing, storing, and passing pointer lists for batched_gemm.

The batch component is no longer used to drive explicit timestepping in time_advance; the kronmult component now serves this purpose. However, some functions within the component are used for implicit timestepping, and the “kronmult as batched gemm” technique is employed for wavelet to realspace transforms in the transformations component.

primary classes/functions

  • class batch

Provides storage for fixed-size list of pointers to same-sized matrices for batched_gemm. assign_entry interface stores pointer to fk::matrix with asserts/size checking; assign_raw stores a pointer without such checks. assign_raw was added after performance measurements showed that creating/destroying many matrix views can become costly (mostly due to reference counting using smart pointers in fk::matrix class).

  • class batch_chain

Given the operands for a kronmult (a list of matrices and a vector), as well as some externally allocated temporary workspace, the batch_chain class internally constructs batch members for batched_gemm. The execute member actually performs the multiplications by calling into the lib_dispatch component. Currently, only used for real space transform.

  • function batched_gemm / batched_gemv

Thin layer over lib_dispatch functions with the same name; lib_dispatch accepts BLAS-style arguments, so these functions translate/setup for those calls.

  • function build_system_matrix

The implicit time advance is driven by solving the system matrix. Given the elements::table describing which elements are active, as well as the 1D coefficient matrices from the PDE class, this function explicitly constructs the system matrix.

notes for improvement

This entire component is likely to be removed in the future. build_system_matrix belongs in the time_advance component or elsewhere; the batch_chain/batch classes are only used for the wavelet to real space transforms in transformations. The current implementation for this transform is a proof of concept that does not scale well and is not distributed across nodes.

boundary_conditions

overview

Some problems require the application (by addition) of boundary condition vectors to the solution vector at every timestep. This component a) generates unscaled boundary condition vectors and b) provides functions that scale the boundary condition vectors for the current time. The mathematical/physical significance of these vectors has not generally been understood on the C++ development side, but the above description demonstrates the most salient features of these vectors for C++ development: boundary condition vectors are tied to the current problem and grid (and so must be regenerated after adapting the grid), and must be scaled for time before addition to the solution vector. These features are shared with source vectors, although the implementations are different.

primary classes/functions

  • function make_unscaled_bc_parts

Given a PDE and grid, returns a pair of “unscaled_bc_parts” representing the left and right unscaled boundary condition vectors. The start and stop element integer parameters are used to generate only part of the vectors; these parameters are used to generate only the part of the vector assigned to a rank in the distribution plan.

  • function generate_scaled_bc

Accepts the unscaled boundary conditions and current time as input, and scales the vectors appropriately.

notes for improvement

This component has a strange interface – an unscaled boundary condition vector is represented by a vector of vectors of vectors of fk::vectors. This three level nesting is the result of each partial term in each term in each dimension contributing to the solution. However, we already have that exact structure in the PDE class – the PDE contains dimensions and terms that contain partial terms. Rather than existing in an isolated component, each partial term should be responsible for producing its own unscaled part, and the PDE should be responsible for providing the boundary condition vector. Housing source/boundary condition vector storage and generation in the PDE class makes sense logically, would declutter the API, and provide an opportunity to refactor these functions (which could stand a rewrite). The creation and application of source vectors is somewhat similar, and should also probably be consolidated in the PDE class.

coefficients

overview

Each PDE defines dimension-many coefficient matrices (sometimes called operator matrices”) for each partial term. These partial term matrices are chained together by matrix multiplication to form a single coefficient matrix for each term and dimension. These matrices are square, dense, and 2^level * degree in size.

Information from elements::table is used to draw values from these matrices for either 1) building the system matrix for implicit time advance or 2) performing system matrix/state vector multiplication for explicit time advance (or, in future, implicit time advance driven by iterative solve). Specifically, each element in the table contains a series of coordinates that designate, for each dimension, where that element should index into the coefficient matrices.

Currently, all coefficients are stored from construction onward on the GPU (if ASGarD is built to use GPUs). Only time-independent coefficients are supported at present, but we expect to support time-dependent coefficients as well.

primary classes/functions

  • function generate_all_coefficients

Takes an input/output PDE<P> argument and fills its coefficient matrices by generating all partial term coefficient matrices for each and chaining them together by multiplication.

  • function generate_coefficients

Generates one partial term coefficient matrix. Invoked from generate_all_coefficients.

notes for improvement

A PDE’s coefficients (at least, the time-independent ones) should be constructed when the PDE is built (RAII). There is no great reason this is not already done. Aside from that, matrix coefficient generation and update functions could easily (and probably should) be member functions of the PDE class. This rework could take place when the planned overhaul of the PDE class occurs.

distribution

overview

The distribution component handles communication between MPI ranks when ASGarD is run with multiple ranks (nodes/GPUs). ASGarD expects one rank per GPU when built to use GPUs. To simplify problem decomposition, we also currently require an even or perfect square number of ranks. If an odd/nonsquare number of ranks is used, ASGarD will detect this condition and “round down” to the nearest valid number of ranks. Note ASGarD’s distribution only handles the explicit time advance case for now; running implicit time advance with multiple ranks will produce an error message and early exit.

ASGarD decomposes the problem across ranks by decomposing the element grid (see elements section), since each element connection contains the information needed to produce a partial solution. We divide the grid into rectangular subgrids as “squarely” as possible given the number of ranks ASGarD is run with (see get_num_subgrid_cols function). We tile these squares in row major order over the element grid, assigning the element connections within each tile to a single rank. The last subgrid in each tile row may be smaller than the others, as the number of elements may not divide the tile size.

The current implementation will produce tiles containing at least as many tile rows as tile columns (i.e., there will be at least as many vertical lines as there are horizontal lines dividing the elment grid). All subgrids within a tile row will write (additively) to coefficients in the same region of y. So, each tile row must update y atomically (current approach) or write separately and reduce into y (previous approach). All subgrids within a tile column will read from the same segment of x.

Two major interprocess communication tasks are required by ASGarD and provided by this component – 1) decomposing/distributing the solution vector and 2) distributing changes to elements::table after element grid adaptation. The first task is accomplished by a balanced pairing of senders with receivers, where senders write to the region of y that receivers need for the next timestep x. Once this pairing is complete, point-to-point communication is used to exchange vector segments. The second task is accomplished by all-to-all communication to form consensus on the new element table, followed by a redistribution of x such that all ranks have the information they need for the next timestep. Task 1 code was implemented first, followed by task 2. Efforts were made to consolidate/share code between these two tasks, but some duplication exists as a result of dissatisfaction with task 1 implementation.

primary classes/functions

  • helper struct node_out

Allows for printing a message once for each node, so that output isn’t clobbered by every local rank’s output.

  • struct message

Simple data object representing an MPI point-to-point message that should be sent. Fields for direction (send/receive), and a range of elements whose coefficients should be transferred.

  • function get_plan

Generates a map from each rank number (0, .. , n-1 for n ranks) to a subgrid representing all the element connections assigned to that rank. Invokes get_subgrid for each rank.

task 1 functions
  • function reduce_results

Deprecated since we use atomic update into y now. Previously, each rank within a tile row would use this function to reduce y by gathering each row member’s contribution.

  • function exchange_results

Takes as input source (y), and dest (x_next) for each rank, as well as the distribution plan. Partners ranks and transfers data using MPI from senders source vector to receivers’ dest vector to ensure all ranks have correct x_next for the next timestep.

  • function gather_errors

Gathers RMSE figures from all ranks on node, for display after each timestep.

  • function gather_result

Gathers all coefficients in a single vector. Note we use a std::vector due to its 64 bit size field; our fk::vectors are currently 32-bit indexed.

task 2 functions
  • function get_global_max

Adaptivity threshold is based on max coefficient value; find maximum across all ranks.

  • function distribute_table_changes

Share changes to element table across ranks.

  • function redistribute_vector

After consolidating the new element table, redistribute x so that all ranks have coefficients (or zeros in the case of refined elements) for the next timestep. Old plan and new plan size should match; assumes elements are only appended to the list of active elements, or deleted with elements shifted left to fill deleted regions.

notes for improvement

  • The helper functions for task 1 are unnecessarily complex (over engineered). Lots of simplification needed in pairing/partnering code.

  • Not in a namespace like the newer components. Should be placed in a distribution:: namespace.

  • MPI performance has not been an emphasis for us, as we have focused on adding features/node local performance. Performance evaluation needed.

elements

overview

Each of the “elements” represented/managed by this component have three primary properties:

  1. an arbitrary but consistent index in the list of active elements. This determines where each element i reads from the system state vector x (x_i) and writes to the next state y (y_i).
  2. coordinates that determine where information from operators (coefficient matrices) should be retrieved to apply to x_i to yield y_i. Each element has dimension-many {level, cell} coordinates.
  3. a unique integral id used to identify elements.

We assume full element connectivity, so for each pair of elements, a connection exists that represents some piece of the work that must be done to form the system matrix (implicit time advance) or apply the system matrix to the state vector (Ax, explicit time advance) to advance time. The elements component contains helper functions for working with elements, as well as the elements::table object that manages a list of active elements, their ordering, and their coordinates.

In short, each element connection is a unit of work that must be done to advance time. The part of the system matrix the element connection requires to advance time is row-indexed by the element’s ordering, and column-indexed by the connected element’s ordering. Consequently, x_i is indexed by the connected element’s ordering, and y_i is indexed by the element’s ordering. Elements contain coordinates that locate submatrices within the single-dimensional coefficient matrix portions required to advance from x_i to y_i when the system matrix is not explicitly formed. See kronmult component.

primary classes/functions

  • class elements::table

Given a PDE and the number of hierarchical basis levels/basis degree, the elements::table constructor applies a sparsity rule to build a list of active elements. This class also includes functions to enable/disable elements when the element grid is adapted (see adaptivity component). get_child_elements function is provided to assist with refinement.

Element coordinates can be retrieved with the get_coords member function, but a flattened version of the element table (with all active coordinates stored contiguously in memory as ordered by the table) is also prepared to assist with on-device kronmult.

  • functions map_to_id and map_to_coords

Translate between an elements’ unique integral id and coordinates.

  • functions get_level_cell and get_1d_index

Translate between a {level, cell} coordinate and a unique id for that coordinate.

notes for improvement

This is one of the clearer components. One potential improvement is in the relationship between the elements::table and adapt::distributed_grid. The distributed_grid contains an element table member. This makes sense because the distributed_grid must add/remove elements from the table during adaptivity. However, the table can be retrieved with a simple getter. It would be better to provide a more limited interface.

fast_math

overview

This simple component primarily exists as a layer over lib_dispatch within the fm namespace. The main functions in this component accept vector/matrix class (see tensors component) arguments and invoke lib_dispatch to perform BLAS operations. The tensor class also has member functions that invoke BLAS (e.g., fk::matrix * fk::vector operator invokes gemv), but the fast_math API allows the user additional flexibility. For example, the fm::gemm function allows the user to specify a view or owning tensor argument C (the result), whereas the member functions will always allocate a new object for C (constraint of return dynamics).

Some utility math functions may also be placed here, when they do not correspond directly to MATLAB functionality – otherwise, those functions are placed in matlab_utilities.

primary classes/functions

  • BLAS-like functions (scal/gemv/gemm/nrm2/copy/axpy/etc.)

Perform BLAS operations on tensors.

notes for improvement

Could be extended as needed with more functionality – not all BLAS or common BLAS extensions are available here.

io

overview

This very small component wraps the HighFive header-only HDF5 library. Its only current use is writing the initial condition/solution vector to disk. To enable io, users must 1) build ASGarD with ASGARD_IO_HIGHFIVE defined, and 2) select a wavelet and/or realspace write frequency at runtime (see ./asgard –help).

primary classes/functions

  • initialize_output_file function

Initializes an HDF5 dataspace and writes the initial condition to disk.

  • update_output_file function

Writes the solution vector to disk.

notes for improvement

Implemented before component namespaces – functions should exist within an io namespace.

Physicists may require additional output information beyond the solution vector, e.g. the adapted grid for a given timestep. This component could be extended to write more information.

Additionally, not designed to work in the distributed case. In this case, every local rank will write over its fellow local rank filenames. There is also no provision for writing the final solution gathered from all ranks.

kronmult

overview

This component provides an interface to kronmult_cuda, where the functions that handle the core computational kernel in explicit timestepping (and in future iterative implicit timestepping) – conceptually, y = Ax, where A is the system matrix representing PDE operators, x is the current state vector and y is the next state vector. For these routines, the tensor-encoded A is not explicitly formed. Instead, for each element connection, element coordinates are used to retrieve small, dense submatrices {A_0, … , A_ndims-1} from each single-dimensional operator matrix (tensor “factor” of A). For each element connection (and each term) we compute a “kronmult” (Kronecker product * vector): y_i = (A_0 kron A_1 kron … A_ndims-1) * x_i. We do not explicitly form the Kronecker product used to compute y_i; the artihmetic is reformulated as a series of small matrix-matrix multiplications.

Kronmults are not actually performed by this component - kronmult_cuda handles that – but this component provides the necessary setup routines for problem decomposition and memory allocation.

primary classes/functions

https://github.com/project-asgard/asgard/blob/d43b3bdaf64db8385d5035f558acad7683ad59cf/src/kronmult.hpp#L19

notes for improvement

Clone this wiki locally