-
Hi, This is not an issue, just a question. How long does it take to assemble a grad*grad (i.e. weak formulation of Laplace equation) on 1million quads at order 1 interpolation (and 2 and 3 as well if possible) on a structured quad square (no bc., just assembly time needed). Integration order should be exact for order 2 (so that's 4 gauss points if I am not mistaken). Thanks for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 19 comments
-
There is something in the README using triangles. For quads I have nothing premade but can try it out when I get back from the holidays. For quads, there is the need of using a more complicated reference mapping if you want to be general w.r.t the shape of the quad. That will certainly sacrifice some cycles which could be neglected by using a less general mapping which just moves and scales the reference quad. We also don't have a very good support for parallel assembly at the moment and, thus, those numbers in the README are single-threaded assembly. |
Beta Was this translation helpful? Give feedback.
-
Ok, thank you. Yes assembly for general quads is good. |
Beta Was this translation helpful? Give feedback.
-
Just to give you a very quick preliminary idea, I've put together something rudimentary: ex_670_speed_benchmark.py. On my System 76 laptop running Pop!_OS 21.04 and just timing with time python docs/examples/ex_670_speed_benchmark.py (hence including meshing and building the basis), I get:
|
Beta Was this translation helpful? Give feedback.
-
Thanks! What do you mean with affine on a quad (I can understand affine for a triangle though)? |
Beta Was this translation helpful? Give feedback.
-
I've actually never tried if MappingAffine works for a quad mesh like is done above. This is not done by default, and the behaviour can be undefined since its not tested at all. |
Beta Was this translation helpful? Give feedback.
-
That seems more than 30% faster than the best I could ever obtain in my former vectorized matlab code some time ago! It is however a factor slower than the 0.95 sec time for 1 million (classical 4 dofs "isoparametric") quads I can get on my laptop with c++ in sparselizard. |
Beta Was this translation helpful? Give feedback.
-
This part is rarely the bottleneck though based on my observations. If you have a nonlinear problem with really complex form, then assembly can be an issue because the jacobian is different for each step and a reassembly is required. |
Beta Was this translation helpful? Give feedback.
-
Is this single core assembly? |
Beta Was this translation helpful? Give feedback.
-
That's absolutely true for low order FEM but it actually become a dominant timing for higher order shape functions (>= order 3) where the assembly time can easily become longer than the matrix solve time if not optimized (for static, steady state |
Beta Was this translation helpful? Give feedback.
-
No it's using the 4 cores of a core i7 cpu |
Beta Was this translation helpful? Give feedback.
-
This is a good point. In scikit-fem we do not have a well optimized assembly routine for the case where you have lots of quadrature points per element. Everything is optimized around the case where you have lots of elements and a relatively low order integration. I have some ideas how to make it better that I hope to test in the future. |
Beta Was this translation helpful? Give feedback.
-
I believe the above timing (5.3 s) is using only one core. We are not automatically doing any sort of multithreading but I could look into parallelizing this example at some point. |
Beta Was this translation helpful? Give feedback.
-
Sounds good. For any future benchmark for higher order you might want to do here are the timings for order 2 and 3 quads (i e. 9 and 16 dofs quads) (gauss quadrature order is increased accordingly): 4 dof quad --> 0.95 sec on a quad core core i7 from 2019 with enough ram. It includes all from meshing (structured mesh so negligible time) to having matrix A fully ready to use. It is not optimized specifically for the grad*grad/structured mesh case and it does not take advantage of any matrix symmetry. |
Beta Was this translation helpful? Give feedback.
-
I did 4 threads and added some caching: import numpy as np
import timeit
import skfem as fe
from numba import jit
@jit(nogil=True, nopython=True)
def nlaplace(out, du, dv):
for i in range(du.shape[1]):
for j in range(du.shape[2]):
for k in range(du.shape[0]):
out[i, j] += du[k, i, j] * dv[k, i, j]
start = timeit.default_timer()
mesh = fe.MeshQuad1.init_tensor(*[np.linspace(0, 1, 1001)]*2)
element = fe.ElementQuad1()
mapping = fe.MappingIsoparametric(mesh, element)
basis = fe.Basis(mesh, element, mapping, intorder=2)
@fe.BilinearForm(nthreads=4)
def laplace(u, v, w):
out = np.empty_like(u.grad[0])
nlaplace(out, u.grad, v.grad)
return out
lap = laplace.assemble(basis)
stop = timeit.default_timer()
print(stop - start) and the following diffs:
It improved the timing from about 5.5 seconds to 3.6 seconds. Still some work to do though. Thanks for the timings! |
Beta Was this translation helpful? Give feedback.
-
@halbux by "matrix A ready to be used" you mean in some sort of CSC/CSR data structure, right? |
Beta Was this translation helpful? Give feedback.
-
Exactly: In CSR format (turns ou the conversion to csr takes some time after all). It s not a BLAS standard function so i had to write it myself. In mkl though there is a function for that which i bet is highly optimized |
Beta Was this translation helpful? Give feedback.
-
Yes! I found out about sparselizard already some time ago but I'm surprised that you work in Finland as well. If you ever happen to visit Aalto University you can find me in the main building M-wing. I used to visit Tampere for project meetings and seminars some years ago but right now there is nothing going on. Maybe we must arrange Finnish Finite Element Fair or something. ;-) |
Beta Was this translation helpful? Give feedback.
-
I will gladly, but so far I have never been.
:)
|
Beta Was this translation helpful? Give feedback.
-
Oops, no, it doesn't! import skfem as fe
from skfem.models.poisson import laplace
m = fe.MeshQuad()
elem = fe.ElementQuad1()
with np.printoptions(suppress=True, precision=3):
for mapping in [fe.MappingAffine(m), fe.MappingIsoparametric(m, elem)]:
print(laplace.assemble(fe.Basis(m, elem, mapping)).toarray()) gives
I had assumed that it would be O. K. on quadrilateral elements that were affinely similar to the reference square; i.e. parallelograms, but in particular rectangular constructions like scikit-fem/docs/examples/ex19.py Lines 53 to 56 in 8248218 (And I am hoping that I have not assumed and used this in any work to date…!) |
Beta Was this translation helpful? Give feedback.
Just to give you a very quick preliminary idea, I've put together something rudimentary: ex_670_speed_benchmark.py.
On my System 76 laptop running Pop!_OS 21.04 and just timing with
time python docs/examples/ex_670_speed_benchmark.py
(hence including meshing and building the basis), I get: