10+ million DoFs on cluster? PETSc? conda-forge? #1079

gatling-nrl · 2023-12-16T21:40:02Z

gatling-nrl
Dec 16, 2023

I'm trying to solve the below (complex) assembly on a mesh with 1-20 million elements (or more? more is better!), using P2 elements. Assembling the system matrix is usually not that bad. But I am having enormous trouble with the solve step.

I'm running on either a intel-i9 16 core desktop with 64 GB memory (my computer) or an HPE Cray EX cluster with 128 cores/node and 238 GB memory/node and up to 16 nodes (preferred environment). scipy seems unable to leverage the cluster resources, so I've tried bringing the matrix over to PETSc (@gdmcbain i saw you tried this a couple years back). I've tried direct solve and GMRES and I've tried to get the Hypre ILU preconditioner working (without success). But I remain stuck at around 100k DoFs before the compute time becomes unmanageable. I'm pretty novice at clusters, MPI, and PETSc, so I'm not sure where the bottleneck is. Some things I try, it seems like it is the actual precondition/solve step, but other things I try seem like it is some kind of MPI comms thing. But I really don't know how to profile what is happening during execution. I use a lot of t1 = time.time() and print(t2-t1) :(

I also need to solve this system 100-200 times for different sigma (it is frequency dependent), so compute time is a big deal. Taking, for example, an hour to solve one frequency is no good. One strategy that has sort of worked is to let each MPI process solve a different frequency, but that quickly runs out of memory, also in the 100k DoFs range.

Here is the assembly I'm working with, with alpha, v, i, sigma, and zsh complex. Sigma, i, and zsh are known and spatially homogenous. alpha and v are unknown. sigma is a 3x3 tensor. (I can do useful things with sigma restricted to a diagonal matrix, if that helps.)

and the code I assemble it with (here sigma0 is the 3x3 tensor):

        shape = basis.global_coordinates().shape[1:]
        ones = np.ones(shape)    
        sigma = np.array([
            [sigma0[0,0]*ones, sigma0[0,1]*ones, sigma0[0,2]*ones],
            [sigma0[1,0]*ones, sigma0[1,1]*ones, sigma0[1,2]*ones],
            [sigma0[2,0]*ones, sigma0[2,1]*ones, sigma0[2,2]*ones],
        ])
        # Assemble
        sigma_r = skfem.DiscreteField(np.real(sigma))
        sigma_i = skfem.DiscreteField(np.imag(sigma))
        Ar = bilinear.assemble(basis, coef=sigma_r)
        Ai = bilinear.assemble(basis, coef=sigma_i)
        A = Ar + 1j*Ai
        B = np.sum([
            1/zsh * bilinear2.assemble(e)
            for e in e_basis
        ], axis=0)
        C = np.column_stack([
            -1/zsh * linear.assemble(e) for e in e_basis
        ])
        D = np.diag([area/zsh for area in e_areas])
        M = scipy.sparse.bmat([
            [A + B, C],
            [C.T,   D],
        ], format="csc")
        b = np.hstack([
            np.zeros(basis.N, dtype=complex),
            ns.currents,
        ])
        # Condense the perfect conductors
        D = wall_dofs
        I = np.setdiff1d(np.arange(basis.N + L), D)
        M = M[I][:,I]
        b = b[I]

The codes work, for small systems (10k dofs is easy on my desktop with scipy and direct solve). Any advice on how to scale this up and speed up the solve of Mx=b for millions of dofs? Or how to profile it? (And what to do with that timing info!)

kinnala · 2023-12-16T21:55:49Z

kinnala
Dec 16, 2023
Maintainer

I don't have experience in running PETSc but it sounds like a reasonable path because it has been used to solve large problems. Some questions I want to ask:

Is it triangular or tetrahedral mesh?
Is the system matrix symmetric?
Is the system matrix positive definite?

I think the general folklore for iterative methods is that if you have symmetric positive definite matrix you should try conjugate gradient method with incomplete Cholesky preconditioner.

4 replies

gatling-nrl Dec 16, 2023
Author

It is tetrahedral. Symmetry depends on sigma. In general it is not symmetric. I'm not sure about positive definite. I think it is, but either way I also know in some cases it has a poor condition number.

Since it is not generally symmetric, I have been trying to get GMRES and ILU to work. (And scipy does it no problem for small numbers of DoFs, but getting it to scale up is the problem.)

I posted this question in the PETSc gitlab as well (with some edits to remove the stuff about assembly) here, since it is probably more of a PETSc question than skfem. I'll add the answers to your questions there as well.

gdmcbain Dec 16, 2023

.> I think the general folklore for iterative methods is that if you have symmetric positive definite matrix you should try conjugate gradient method with incomplete Cholesky preconditioner.

I think this is for monolithic problems, like Poisson; for mixed problems, I believe the folklore is to exploit the saddle-point block first.

gdmcbain Dec 16, 2023

Symmetry depends on sigma. In general it is not symmetric.

Ouch. This is going to hurt. The Stokes example is symmetric. The same idea of block diagonal (or triangular) preconditioning is used in asymmetric systems (mutatis mutandis) but the multigrid slows down. Or has to be specialized for the asymmetry which is nontrivial.

gdmcbain Dec 16, 2023

if you have symmetric positive definite matrix you should try conjugate gradient method

Further to that (and further in a direction which is off topic, sorry), I have heard arguments that even for symmetric positive definite systems GMRES is better because it's basically just as good in good cases while at the same time handling much better little adventitious asymmetries sneaking in via round-off.

gdmcbain · 2023-12-16T22:08:25Z

gdmcbain
Dec 16, 2023

G'day. It's a couple of years since I've worked on problems like this but let's see. I don't think I got very far with PETSc; that was a long-term goal that has been interrupted, but library aside, my understanding is that the best way to solve problems like this is multigrid with a block diagonal preconditioner. We've got an example for the Stokes system, which is structurally pretty similar (maybe missing the lower-right D matrix, but that can be handled).

Example 32: Block diagonally preconditioned Stokes solver

The easiest multigrid solvers to invoke are pyamg and pyamgcl; we've got examples of these . The kind of preconditioner required is easy enough to build by hand in skfem.

Last time I looked, pyamg hadn't been parallelized. There are parallel multigrid solvers out there, in PETSc (HYPRE BoomerAMG), but I'm not sure how much work it will be to interface to them. The first thing to do, I think, is try pyamg and the block diagonal preconditioner as in the Stokes example and see how it goes on your problem on your hardware.

0 replies

kinnala · 2023-12-16T22:40:35Z

kinnala
Dec 16, 2023
Maintainer

Perhaps this is not directly related to your problem, but I just tried petsc4py for fun and was able to solve 3D Poisson problem with P1 elements and 1,7 million DOFs in 1 minute 34 seconds on my Apple laptop. I did not expect it to run so quickly but the result looks pretty OK:

Maybe this code will give you some hints how this petsc4py can be used. Assembly is not here parallel but if it's not the bottleneck I think you can then do it on one node and let PETSc move stuff around?

import sys
import petsc4py
import numpy as np
from datetime import datetime
from petsc4py import PETSc

petsc4py.init(sys.argv)

OptDB = PETSc.Options()

A = PETSc.Mat()
A.create(comm=PETSc.COMM_WORLD)

from skfem import *
import skfem

m = MeshTet().refined(7)  # perhaps make this smaller if you want to test

basis = Basis(m, ElementTetP1())

from skfem.models import laplace, unit_load

B = laplace.assemble(basis)
f = unit_load.assemble(basis)

start = datetime.now()

B, f = enforce(B, f, D=basis.get_dofs())
Br, fr, p = skfem.utils.rcm(B, f)  # reorder rows and cols to reduce fill in

# create petsc matrix directly from CSR data
A.createAIJWithArrays((basis.N, basis.N), (Br.indptr, Br.indices, Br.data))


ksp = PETSc.KSP()
ksp.create(comm=A.getComm())
ksp.setType(PETSc.KSP.Type.CG)
ksp.getPC().setType(PETSc.PC.Type.GAMG)

ksp.setOperators(A)

x, b = A.createVecs()
b.setArray(fr)
ksp.solve(b, x)

y1 = np.zeros(basis.N)
y1[p] = x.array

print(datetime.now() - start)

# uncomment to compare against direct solve (do not try with refined(6) or larger)
# y2 = solve(B, f)
# print(np.max(np.abs(y1 - y2)))

22 replies

gdmcbain Dec 19, 2023

Meanwhile there is an awesome discussion of these points through the block-preconditioning/Schur complement/field-split lens over on the PETSc issue; see https://gitlab.com/petsc/petsc/-/issues/1512#note_1699627307 and following. I think I'll wait to see what the PETSc experts over there come up with; it could be very valuable here. The basic idea does seem to be to respect the blocks, to treat the conditions/constraints differently to the partial differential field-operators.

gatling-nrl Dec 20, 2023
Author

I too am focused on the petsc schur solution right now instead of working with the boundary conditions (although I definitely appreciate the offer!)

So far I have gotten 6 million DoFs to solve in a handful of minutes (the skfem assembly did take a while for that) which is so far the biggest problem I have managed.

I don't think in 10 years I could have googled / rtfm'ed my way to getting petsc working without their help. That package is complicated. Of course, I thought the same thing about skfem for a while too...

Here's the working code as of right now (that is a long, evolving thread)...

https://gitlab.com/petsc/petsc/-/issues/1512#note_1701654021

gatling-nrl Dec 20, 2023
Author

Is there any thoughts on (optionally) assembling directly to PETSc data structures? What about making skfem (optionally) parallel? OpenMP? MPI? Dask?

Of course, there is some huge appeal to numpy/scipy and serial, because PETSc and parallel does not help make things easier to understand...

And also, these things make more dependencies, and not easy ones to get working everywhere. Another huge advantage of numpy/scipy/matplotlib, the trinity of science in python.

gdmcbain Dec 20, 2023

There was talk of this back in #236, #713, …. It's always made a lot of sense, as an option. If I recall correctly, parallel assembly was the default until #317; from memory I found it made debugging harder so lobbied against it back then. Yeah, it'd be nice to be able to work with skfem as is in pure scipy then should one wish to scale simply switch out the assembly functions, in the same way that we switch out solvers in ex9.

kinnala Jun 7, 2024
Maintainer

Is there any thoughts on (optionally) assembling directly to PETSc data structures? What about making skfem (optionally) parallel? OpenMP? MPI? Dask?

Assembling directly to PETSc data structures should be a simple addition, but I think that instead of import petsc4py, we should make it more convenient, e.g., the output of assemble could be optionally compatible with the PETSc functions such as createAIJWithArrays.

Making skfem parallel has many aspects to it, depending on what is meant by it. Since we talk in the context of PETSc, I assume this has to do with assembling a distributed PETSc matrix in a distributed fashion after a domain decomposition has been performed for the mesh, one subdomain at a time in different nodes.

It would require lots of internal changes to make such thing as convenient and smooth as possible. I think we should instead make it possible to easily use the existing functionality to build such parallel solvers - this library is a "build-your-own-solver -kit" after all. For example, we could have

Mesh.decompose(...) to run PyMETIS and save the different mesh pieces to files, including a reverse mapping from local to global nodes
We already have the keyword argument to in asm(bilinf, basis, to=...) which is not much used, but could be provided something like fmt_petsc which would be a new function in skfem.utils for turning COO sparse array into a PETSc matrix
If assembly is done locally, one mesh piece at a time, then something would have to be done with remapping the matrix indices from the local (single mesh piece) to the global (large) mesh.

It should be then quite easy to create a solver with PETSc that is assembling and solving everything in a distributed fashion.

As a side note: We recently solved 3D 86 M DOF Poisson problem using scikit-fem, no PETSc but a custom algorithm for domain decomposition using model order reduction, no special assumptions on the shape of the mesh: https://arxiv.org/pdf/2404.06260

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10+ million DoFs on cluster? PETSc? conda-forge? #1079

{{title}}

Replies: 3 comments 26 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

10+ million DoFs on cluster? PETSc? conda-forge? #1079

gatling-nrl Dec 16, 2023

Replies: 3 comments · 26 replies

kinnala Dec 16, 2023 Maintainer

gatling-nrl Dec 16, 2023 Author

gdmcbain Dec 16, 2023

gdmcbain Dec 16, 2023

gdmcbain Dec 16, 2023

gdmcbain Dec 16, 2023

kinnala Dec 16, 2023 Maintainer

gdmcbain Dec 19, 2023

gatling-nrl Dec 20, 2023 Author

gatling-nrl Dec 20, 2023 Author

gdmcbain Dec 20, 2023

kinnala Jun 7, 2024 Maintainer

gatling-nrl
Dec 16, 2023

Replies: 3 comments 26 replies

kinnala
Dec 16, 2023
Maintainer

gatling-nrl Dec 16, 2023
Author

gdmcbain
Dec 16, 2023

kinnala
Dec 16, 2023
Maintainer

gatling-nrl Dec 20, 2023
Author

gatling-nrl Dec 20, 2023
Author

kinnala Jun 7, 2024
Maintainer