10+ million DoFs on cluster? PETSc? conda-forge? #1079
Replies: 3 comments 26 replies
-
I don't have experience in running PETSc but it sounds like a reasonable path because it has been used to solve large problems. Some questions I want to ask:
I think the general folklore for iterative methods is that if you have symmetric positive definite matrix you should try conjugate gradient method with incomplete Cholesky preconditioner. |
Beta Was this translation helpful? Give feedback.
-
G'day. It's a couple of years since I've worked on problems like this but let's see. I don't think I got very far with PETSc; that was a long-term goal that has been interrupted, but library aside, my understanding is that the best way to solve problems like this is multigrid with a block diagonal preconditioner. We've got an example for the Stokes system, which is structurally pretty similar (maybe missing the lower-right D matrix, but that can be handled). The easiest multigrid solvers to invoke are pyamg and pyamgcl; we've got examples of these . The kind of preconditioner required is easy enough to build by hand in skfem. Last time I looked, pyamg hadn't been parallelized. There are parallel multigrid solvers out there, in PETSc (HYPRE BoomerAMG), but I'm not sure how much work it will be to interface to them. The first thing to do, I think, is try pyamg and the block diagonal preconditioner as in the Stokes example and see how it goes on your problem on your hardware. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I'm trying to solve the below (complex) assembly on a mesh with 1-20 million elements (or more? more is better!), using P2 elements. Assembling the system matrix is usually not that bad. But I am having enormous trouble with the solve step.
I'm running on either a intel-i9 16 core desktop with 64 GB memory (my computer) or an HPE Cray EX cluster with 128 cores/node and 238 GB memory/node and up to 16 nodes (preferred environment).
scipy
seems unable to leverage the cluster resources, so I've tried bringing the matrix over to PETSc (@gdmcbain i saw you tried this a couple years back). I've tried direct solve and GMRES and I've tried to get the Hypre ILU preconditioner working (without success). But I remain stuck at around 100k DoFs before the compute time becomes unmanageable. I'm pretty novice at clusters, MPI, and PETSc, so I'm not sure where the bottleneck is. Some things I try, it seems like it is the actual precondition/solve step, but other things I try seem like it is some kind of MPI comms thing. But I really don't know how to profile what is happening during execution. I use a lot oft1 = time.time()
andprint(t2-t1)
:(I also need to solve this system 100-200 times for different
sigma
(it is frequency dependent), so compute time is a big deal. Taking, for example, an hour to solve one frequency is no good. One strategy that has sort of worked is to let each MPI process solve a different frequency, but that quickly runs out of memory, also in the 100k DoFs range.Here is the assembly I'm working with, with
alpha
,v
,i
,sigma
, andzsh
complex.Sigma
,i
, andzsh
are known and spatially homogenous.alpha
andv
are unknown.sigma
is a 3x3 tensor. (I can do useful things withsigma
restricted to a diagonal matrix, if that helps.)and the code I assemble it with (here sigma0 is the 3x3 tensor):
The codes work, for small systems (10k dofs is easy on my desktop with scipy and direct solve). Any advice on how to scale this up and speed up the solve of
Mx=b
for millions of dofs? Or how to profile it? (And what to do with that timing info!)Beta Was this translation helpful? Give feedback.
All reactions