speed benchmark #716

halbux · 2021-07-19T17:18:53Z

halbux
Jul 19, 2021

Hi,

This is not an issue, just a question.
I am developing my fem library too, in c++, but before that I worked in Matlab and I remember being able to get decent speeds with fully vectorized code.
May I ask for a timing for my personal knowledge about what optimized python code can provide:

How long does it take to assemble a grad*grad (i.e. weak formulation of Laplace equation) on 1million quads at order 1 interpolation (and 2 and 3 as well if possible) on a structured quad square (no bc., just assembly time needed). Integration order should be exact for order 2 (so that's 4 gauss points if I am not mistaken).
Time should be from assembly start to having the matrix A (of Ax=b) fully ready to be used.
Corei7 4cores laptop would be best for the timing.

Thanks for your help!

Answered by gdmcbain

Jul 19, 2021

Just to give you a very quick preliminary idea, I've put together something rudimentary: ex_670_speed_benchmark.py.

On my System 76 laptop running Pop!_OS 21.04 and just timing with

time python docs/examples/ex_670_speed_benchmark.py

(hence including meshing and building the basis), I get:

mapping	t/s
affine	1.8
isoparametric	5.3

View full answer

kinnala · 2021-07-19T17:30:21Z

kinnala
Jul 19, 2021
Maintainer

There is something in the README using triangles. For quads I have nothing premade but can try it out when I get back from the holidays.

For quads, there is the need of using a more complicated reference mapping if you want to be general w.r.t the shape of the quad. That will certainly sacrifice some cycles which could be neglected by using a less general mapping which just moves and scales the reference quad. We also don't have a very good support for parallel assembly at the moment and, thus, those numbers in the README are single-threaded assembly.

0 replies

halbux · 2021-07-19T19:26:21Z

halbux
Jul 19, 2021
Author

Ok, thank you. Yes assembly for general quads is good.

0 replies

gdmcbain · 2021-07-19T23:52:01Z

gdmcbain
Jul 19, 2021

Just to give you a very quick preliminary idea, I've put together something rudimentary: ex_670_speed_benchmark.py.

On my System 76 laptop running Pop!_OS 21.04 and just timing with

time python docs/examples/ex_670_speed_benchmark.py

(hence including meshing and building the basis), I get:

mapping	t/s
affine	1.8
isoparametric	5.3

0 replies

halbux · 2021-08-14T11:59:57Z

halbux
Aug 14, 2021
Author

Thanks!

What do you mean with affine on a quad (I can understand affine for a triangle though)?
For the isoparametric case: are you talking about the usual bilinear shape function (4dofs) quad?

0 replies

kinnala · 2021-08-14T12:09:12Z

kinnala
Aug 14, 2021
Maintainer

I've actually never tried if MappingAffine works for a quad mesh like is done above. This is not done by default, and the behaviour can be undefined since its not tested at all.

0 replies

halbux · 2021-08-14T12:10:00Z

halbux
Aug 14, 2021
Author

That seems more than 30% faster than the best I could ever obtain in my former vectorized matlab code some time ago!

It is however a factor slower than the 0.95 sec time for 1 million (classical 4 dofs "isoparametric") quads I can get on my laptop with c++ in sparselizard.

0 replies

kinnala · 2021-08-14T12:19:25Z

kinnala
Aug 14, 2021
Maintainer

This part is rarely the bottleneck though based on my observations. If you have a nonlinear problem with really complex form, then assembly can be an issue because the jacobian is different for each step and a reassembly is required.

0 replies

kinnala · 2021-08-14T12:20:33Z

kinnala
Aug 14, 2021
Maintainer

It is however a factor slower than the 0.95 sec time for 1 million (classical 4 dofs "isoparametric") quads I can get on my laptop with c++ in sparselizard.

Is this single core assembly?

0 replies

halbux · 2021-08-14T12:23:58Z

halbux
Aug 14, 2021
Author

That's absolutely true for low order FEM but it actually become a dominant timing for higher order shape functions (>= order 3) where the assembly time can easily become longer than the matrix solve time if not optimized (for static, steady state
and harmonic simulations at least). This is even more visible for electromagnetic problems, where even optimized assembly can still take a relatively large share of the total time (e.g. in magnetoquasistatic)

0 replies

halbux · 2021-08-14T12:25:03Z

halbux
Aug 14, 2021
Author

It is however a factor slower than the 0.95 sec time for 1 million (classical 4 dofs "isoparametric") quads I can get on my laptop with c++ in sparselizard.

Is this single core assembly?

No it's using the 4 cores of a core i7 cpu

0 replies

kinnala · 2021-08-14T12:28:27Z

kinnala
Aug 14, 2021
Maintainer

That's absolutely true for low order FEM but it actually become a dominant timing for higher order shape functions (>= order 3) where the assembly time can easily become longer than the matrix solve time if not optimized (for static, steady state
and harmonic simulations at least). This is even more visible for electromagnetic problems, where even optimized assembly can still take a relatively large shatmre of the total time.

This is a good point. In scikit-fem we do not have a well optimized assembly routine for the case where you have lots of quadrature points per element. Everything is optimized around the case where you have lots of elements and a relatively low order integration. I have some ideas how to make it better that I hope to test in the future.

0 replies

kinnala · 2021-08-14T12:34:48Z

kinnala
Aug 14, 2021
Maintainer

Is this single core assembly?

No it's using the 4 cores of a core i7 cpu

I believe the above timing (5.3 s) is using only one core. We are not automatically doing any sort of multithreading but I could look into parallelizing this example at some point.

0 replies

halbux · 2021-08-14T12:35:56Z

halbux
Aug 14, 2021
Author

Sounds good. For any future benchmark for higher order you might want to do here are the timings for order 2 and 3 quads (i e. 9 and 16 dofs quads) (gauss quadrature order is increased accordingly):

4 dof quad --> 0.95 sec
9 dof quad --> 4.2 sec
16 dof quad --> 12.8 sec

on a quad core core i7 from 2019 with enough ram. It includes all from meshing (structured mesh so negligible time) to having matrix A fully ready to use. It is not optimized specifically for the grad*grad/structured mesh case and it does not take advantage of any matrix symmetry.

0 replies

kinnala · 2021-08-14T12:56:24Z

kinnala
Aug 14, 2021
Maintainer

I did 4 threads and added some caching:

import numpy as np
import timeit
import skfem as fe
from numba import jit

@jit(nogil=True, nopython=True)
def nlaplace(out, du, dv):
    for i in range(du.shape[1]):
        for j in range(du.shape[2]):
            for k in range(du.shape[0]):
                out[i, j] += du[k, i, j] * dv[k, i, j]

start = timeit.default_timer()
mesh = fe.MeshQuad1.init_tensor(*[np.linspace(0, 1, 1001)]*2)
element = fe.ElementQuad1()
mapping = fe.MappingIsoparametric(mesh, element)
basis = fe.Basis(mesh, element, mapping, intorder=2)

@fe.BilinearForm(nthreads=4)
def laplace(u, v, w):
    out = np.empty_like(u.grad[0])
    nlaplace(out, u.grad, v.grad)
    return out


lap = laplace.assemble(basis)
stop = timeit.default_timer()
print(stop - start)

and the following diffs:

@@ -164,6 +165,7 @@ class MappingIsoparametric(Mapping):
 
         return detDF
 
+    @lru_cache(maxsize=128)
     def invDF(self, X, tind=None):
         J = [[self.J(i, j, X, tind=tind) for j in range(self.dim)]
              for i in range(self.dim)]

@@ -10,7 +10,7 @@ class ElementH1(Element):
     def gbasis(self, mapping, X, i, tind=None):
         """Identity transformation."""
         phi, dphi = self.lbasis(X, i)
-        invDF = mapping.invDF(X, tind)
+        invDF = mapping.invDF(HashableNdArray(X), HashableNdArray(tind) if tind is not None else None)
         if len(X.shape) == 2:
             return (DiscreteField(
                 value=np.broadcast_to(phi, (invDF.shape[2], invDF.shape[3])),

It improved the timing from about 5.5 seconds to 3.6 seconds. Still some work to do though.

Thanks for the timings!

0 replies

kinnala · 2021-08-14T13:47:23Z

kinnala
Aug 14, 2021
Maintainer

@halbux by "matrix A ready to be used" you mean in some sort of CSC/CSR data structure, right?

0 replies

halbux · 2021-08-14T14:08:22Z

halbux
Aug 14, 2021
Author

Exactly: In CSR format (turns ou the conversion to csr takes some time after all). It s not a BLAS standard function so i had to write it myself. In mkl though there is a function for that which i bet is highly optimized

0 replies

kinnala · 2021-08-15T11:17:41Z

kinnala
Aug 15, 2021
Maintainer

Yes!

I found out about sparselizard already some time ago but I'm surprised that you work in Finland as well. If you ever happen to visit Aalto University you can find me in the main building M-wing.

I used to visit Tampere for project meetings and seminars some years ago but right now there is nothing going on. Maybe we must arrange Finnish Finite Element Fair or something. ;-)

0 replies

halbux · 2021-08-15T11:37:58Z

halbux
Aug 15, 2021
Author

I will gladly, but so far I have never been. :)

0 replies

gdmcbain · 2021-08-16T04:23:40Z

gdmcbain
Aug 16, 2021

I've actually never tried if MappingAffine works for a quad mesh like is done above.

Oops, no, it doesn't!

import skfem as fe
from skfem.models.poisson import laplace

m = fe.MeshQuad()
elem = fe.ElementQuad1()

with np.printoptions(suppress=True, precision=3):
    for mapping in [fe.MappingAffine(m), fe.MappingIsoparametric(m, elem)]:
        print(laplace.assemble(fe.Basis(m, elem, mapping)).toarray())

gives

[[ 0.5 -0.5  0.  -0. ]
 [-0.5  1.5 -0.  -1. ]
 [ 0.  -0.   0.5 -0.5]
 [-0.  -1.  -0.5  1.5]]

[[ 0.667 -0.167 -0.333 -0.167]
 [-0.167  0.667 -0.167 -0.333]
 [-0.333 -0.167  0.667 -0.167]
 [-0.167 -0.333 -0.167  0.667]]

I had assumed that it would be O. K. on quadrilateral elements that were affinely similar to the reference square; i.e. parallelograms, but in particular rectangular constructions like

scikit-fem/docs/examples/ex19.py

Lines 53 to 56 in 8248218

    
           mesh = MeshQuad.init_tensor( 
        
               np.linspace(-1, 1, 2 * ncells) * halfwidth[0], 
        
               np.linspace(-1, 1, 2 * ncells * ceil(halfwidth[1] // 
        
                                                    halfwidth[0])) * halfwidth[1])

(And I am hoping that I have not assumed and used this in any work to date…!)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed benchmark #716

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 19 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

speed benchmark #716

halbux Jul 19, 2021

Replies: 19 comments

kinnala Jul 19, 2021 Maintainer

halbux Jul 19, 2021 Author

gdmcbain Jul 19, 2021

halbux Aug 14, 2021 Author

kinnala Aug 14, 2021 Maintainer

halbux Aug 14, 2021 Author

kinnala Aug 14, 2021 Maintainer

kinnala Aug 14, 2021 Maintainer

halbux Aug 14, 2021 Author

halbux Aug 14, 2021 Author

kinnala Aug 14, 2021 Maintainer

kinnala Aug 14, 2021 Maintainer

halbux Aug 14, 2021 Author

kinnala Aug 14, 2021 Maintainer

kinnala Aug 14, 2021 Maintainer

halbux Aug 14, 2021 Author

kinnala Aug 15, 2021 Maintainer

halbux Aug 15, 2021 Author

gdmcbain Aug 16, 2021

halbux
Jul 19, 2021

kinnala
Jul 19, 2021
Maintainer

halbux
Jul 19, 2021
Author

gdmcbain
Jul 19, 2021

halbux
Aug 14, 2021
Author

kinnala
Aug 14, 2021
Maintainer

halbux
Aug 14, 2021
Author

kinnala
Aug 14, 2021
Maintainer

kinnala
Aug 14, 2021
Maintainer

halbux
Aug 14, 2021
Author

halbux
Aug 14, 2021
Author

kinnala
Aug 14, 2021
Maintainer

kinnala
Aug 14, 2021
Maintainer

halbux
Aug 14, 2021
Author

kinnala
Aug 14, 2021
Maintainer

kinnala
Aug 14, 2021
Maintainer

halbux
Aug 14, 2021
Author

kinnala
Aug 15, 2021
Maintainer

halbux
Aug 15, 2021
Author

gdmcbain
Aug 16, 2021