Issues in MPI PETSc Communicator implementation #16

amartinhuertas · 2020-06-15T16:48:05Z

This thread enumerates the most important issues/limitations that I (@amartinhuertas) found along the way in the development carried out so far in the branch mpi_petsc_communicator. The list corresponds to the current status of this branch, reflected in commit ID f595e3e. I propose to go over them in a meeting with @fverdugo and @santiagobadia.

MPIPETScDistributedIndexSet/MPIPETScDistributedVector

The current implementation of MPIPETScDistributedVector is subject to three main limitations:

Its eltype (lets call it T) must be such that T<:Union{Number,AbstractVector{<:Number}}.
T must be such that either sizeof(T)=sizeof(Float64) (if T<:Number) or sizeof(eltype(T))=sizeof(Float64) (if T<:AbstractVector{<:Number}), resp.
[UPDATE: solved in c57f212] If T<:AbstractVector{<:Number}, length for all entries of MPIPETScDistributedVector must match.

(1) and (2) are caused because we are implementing MPIPETScDistributedVector using the services provided by the ghosted variant of the parallel distributed-memory PETSc vector PETSc.Vec{Float64}, which is restricted to Float64 entries. We could consider primitive element types with size different to Float64 (e.g., say, Int32, or Float16), but I still have to think how to handle the case in which the size in bytes of the local part of the vector is not a multiple of 8 (i.e., the size in bytes of Float64). Question: Is such feature actually going to be needed? Not so far (see below). On the other hand, (3) might be relatively easily circumvented using an smarter algorithm to the one available now (see here for more details) in order to create the ad-hoc MPIPETScDistributedIndexSet. Note that this smarter algorithm requires, though, additional communication.

We note that we can currently cope with the above three limitations because of two main reasons:

The algorithm that currently generates the global numbering of DoFs only requires to exchange among nearest neighbours vectors of type DistributedVector{Vector{Int64}} (see here, here, and here) and the FE spaces tested so far are such that we have the same number of DOFs per each cell.
The communication required in order to assemble the coefficient matrix and the RHS vector of the linear system are, at present, entirely managed by PETSc (here and here, resp.).

On the other hand, the current implementation is such that:

[UPDATE: solved in e98d5f4] The entries of the MPIPETScDistributedVector are actually stored twice, in particular, in a Julia array, and in a PETSc.Vec data structure; see here and here, resp.
The set up of a MPIPETScDistributedIndexSet requires communication, in particular, here.

(1) is caused because the local numbering of entries (e.g., cells or DOFs) in GridapDistributed.jl does not match the one hard-coded in PETSc ghosted vectors, i.e., owned entries first, then ghost entries.
There a two main performance implications of (1). First, prior to communication, we have to copy the local entries from the Julia array into the PETsc.Vec array (see here), and perform the reverse operation after communication (see here). Second, right after solving a linear system using PETSc, we have to communicate (in order to get the values associated to ghost DoFs; see here), and then copy the local entries from the PETsc.Vec array to the Julia array (see here). (2) is required because the global numbering of entries (e.g., cells or DoFs) generated by GridapDistributed.jl is not such that the ones owned by the first processor are numbered first, followed by those of the second processor, and so on, as required by PETSc. As a consequence one has to manage, in the PETSc library parlance, a global APP ordering and a PETSc ordering, and transform global identifiers in the APP ordering into PETSc global numbering identifiers (here is where the communication (2) stems from). Question: Is this something reasonable that we can afford or should we look for a more efficient solution instead? (options: force a local/global ordering, use of PETSc.VecScatter{T} directly instead of ghosted vectors, etc.)

Parallel Assembly

The current parallel assembly process is implemented in the assembly_matrix_vector method (available here) dispatched for the DistributedAssembler <: Assembler data type, and the specialization of the methods allocate_coo_vectors, finalize_coo!, and sparse_from_coo for the PETSc matrix data type PETSc.Mat{Float64}, available here. In the current implementation of the parallel assembly, there are both performance pitfalls, and temporary workarounds that I had to apply in order to accommodate the current algorithm into the software design of Gridap.jl/GridapDistributed.jl. While I am positive that some of these can be solved by additional work, I believe that some others are caused because I had to stick into the current way the method assembly_matrix_vector drives the assembly process. In the sequel, I will explicitly mention when I believe the corresponding issue to be caused by this.

Let us enumerate and elaborate a little bit more on them:

[UPDATE: Solved in 1aae8ae] The current algorithm does not compute, before actual numerical assembly, the (full) sparsity pattern of the rows which are locally owned by each processor. Instead, we preallocate some storage for the matrix (with a rough, upper estimate of the number of nonzero elements per row), and entries are dynamically introduced into the matrix; see here and here. Our experience with FEMPAR reveals that this has a number of practical performance and memory implications that we want to avoid in the final version of the algorithm (see also issue Symbolic algorithms sparsity pattern #9).
[UPDATE: Solved in 1aae8ae] As a complementary side-note to 1., let me recall (to not forget/reuse) that: (1) we discussed here a possible way to accommodate the final/performant algorithm such that sticks into the current structure of assembly_matrix_vector; (2) I already wrote an implementation of this algorithm, but it does not stick to the current structure of assembly_matrix_vector, see here for more details; (3) in the most general case (non-conforming meshes, hanging DoF constraints), the final/performant algorithm requires to perform an ad-hoc communication among nearest neighbours in order to set up the sparsity pattern of the locally owned rows, see here for more details. At first glance, I do not see how it can be implemented given the aforementioned limitations of MPIPETScDistributedVector mentioned above.
[UPDATE: Solved in 25457d3] At present, the raw matrix contributions stored in the I,J,V COO vectors are added into the global matrix on a one-by-one basis; see here. This is not acceptable from the performance point of view when dealing with PETSc matrices. Up to my knowledge, there is no function in the PETSc API which lets one inject all entries in the I,J,V arrays in one shot. The highest granularity function in the PETSc API lets one add a 2D dense matrix (e.g., a cell matrix) into the global matrix in one shot, so that it would be more useful if we could delimit the contribution of each cell within the I,J,V arrays, or avoid the COO arrays during the actual numerical assembly, and inject the whole cell matrix.
[UPDATE: Solved in 25457d3] My major objections/concerns with respect to the current structure of assembly_matrix_vector are related to the way the assembly of the RHS is handled, that is somehow borrowed from SparseMatrixAssembler and its data type extensions in Gridap.jl. In particular, this function relies on the fill_matrix_and_vector_coo_numeric!(I,J,V,b,assem,data) method, with b being the global (but distributed) vector data structure, and assem the local assembler corresponding to each MPI task, with type (extending) SparseMatrixAssembler. There are two issues with this method. First, the contributions of the entries of the cell vector to those of the global vector are added on a one-by-one basis (see here), which is not acceptable from the performance point of view when dealing with PETSc vectors. Second, before updating the global entry, the code is written such that the entry is first read. PETsc vectors are such that you cannot read an entry that is not locally owned when using global numbering. To workaround this issue, I have a temporary hack now, which consists on re-defining the _assemble_matrix_and_vector_fill! in GridapDistributed.jl (see here). This hack is not acceptable.
[UPDATE: Option discarded during meeting in favor of the one that subassembles local portions of the RHS vector.] The solution that I think best balances all constraints to address 4. is to gather all cell contributions to the global vector into a pair of arrays, say (Ib, b), with the subassembled raw contributions, and then assemble them at once in single method, say vector_from_raw_entries. This has two inherent benefits: (1) no need to expose the "finalization" of the global vector in assembly_matrix_vector (see here for more details). (2) see here, "Other issues to be discussed".
(Not really an issue, but something to discuss.) I had to parameterize the type of the vals member variable of DistributedFEFunction (see here). This enables the possibility to write an specific variant of solve! for the particular FE functions at hand; see here for more details. Is this the way to go?

The text was updated successfully, but these errors were encountered:

amartinhuertas closed this as completed Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues in MPI PETSc Communicator implementation #16

Issues in MPI PETSc Communicator implementation #16

amartinhuertas commented Jun 15, 2020 •

edited

Loading

Issues in MPI PETSc Communicator implementation #16

Issues in MPI PETSc Communicator implementation #16

Comments

amartinhuertas commented Jun 15, 2020 • edited Loading

MPIPETScDistributedIndexSet/MPIPETScDistributedVector

Parallel Assembly

amartinhuertas commented Jun 15, 2020 •

edited

Loading