Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized algorithms not implemented for 1D and 2D geometry (even with the non-adaptive method) #428

Open
michaeltouati opened this issue Jul 22, 2021 · 17 comments
Labels
feature-request something that could be added to the code

Comments

@michaeltouati
Copy link

michaeltouati commented Jul 22, 2021

  Dear developpers,

Do you plan to add vectorization of SMILEI for 1D and 2D Cartesian geometry soon?
Since it is already implemented for simulation using a 3D Cartesian geometry, I think it wouldn't take a lot of time to extend it to the simpler 1D and 2D Cartesian geometry cases.
Waiting for your answer,
Best wishes,
Michaël

@mccoys mccoys added the feature-request something that could be added to the code label Jul 22, 2021
@mccoys
Copy link
Contributor

mccoys commented Jul 22, 2021

In fact, vectorization is available in 2D, but not in the adaptive mode. This would require some more investigations.
Summary:

  • In 1D: there is no vectorization available yet.
  • In 2D: vectorization is available but not in adaptive mode.
  • In 3D: vectorization is available.
  • In AM: there is no vectorization available yet.

@michaeltouati
Copy link
Author

I'm unlucky.
I was interested in 1D Cartesian expensive simulations on a desktop using the threadripper AMD technology (64 cores).
I asked the question to know if the developers who implemented the 2D and 3D Cartesian case has planned to do it in 1D fastly (week time scale) or not. Depending on if yes or not, I'm eventually willing to do it (months time scale).

@beck-llr
Copy link
Contributor

There is a reason why it has not been done yet. We estimate that the performance improvement would probably not be significant. We could be wrong of course.

We are currently working on vectorization for AM geometry but there is no schedule for vectorization in 1D for the moment.

@michaeltouati
Copy link
Author

michaeltouati commented Jul 22, 2021

  Re-bonjour,

"There is a reason why it has not been done yet. We estimate that the performance improvement would probably not be significant. We could be wrong of course."

The way I see it, it should accelerate every loops even in 1D Cartesian geometry!

"We are currently working on vectorization for AM geometry but there is no schedule for vectorization in 1D for the moment."

Ok. I'll have a look about what has been done in 2D Cartesian case and I'll try to extend it to the 1D one.
Can you tell me the source files that should be modified accordingly? I would earn a lot of time.

En te remerciant,
Best regards,
Michaël

@MickaelGrech
Copy link
Contributor

MickaelGrech commented Jul 22, 2021

Ciao Michaël,
Just to briefly come back to what @beck-llr wrote, it is not so much that one would expect a small relative gain in 1D, the relative gain could indeed be large in 1D, in particular as one can usually consider very large number of particles per cell.
It's more that the absolute gain in CPU time would not be so large at the end for most cases. Indeed, most 1D simulations can usually run really fast, if not on your workstation than just getting to a medium size cluster/supercomputer.
When you have finite manpower, you need to make choices :D here, we chose to spent more CPU time & less human/dev time.
This being said, are you sure it's a smart move to get involved in this vectorization work?

@michaeltouati
Copy link
Author

Coucou Micka,
J'espère que tu vas bien.

Yes, it would be wonderful if it is implemented in 1D.

I'm running expensive 1D-3V PIC simulations of TNSA using SMILEI resolving the few nm of rear side layer of Aluminum oxide contaminated by water vapor and hydrocarbures diffusion inside it.

I've already run the simulations with laser pulses from 30 fs FWHM to few hundreds of fs FWHM interacting with a 10 microns thick Aluminum target on tesseract of DIRAC supercomputing center in UK.

I wanted to run the less expensive simulations of 3 microns thick targets with the same pulse durations as mentioned before on my nice computer that has the AMD threadripper CPU (64 cores).

The vectorization would make it more or less ~4 times faster the simulations in both cases according to my experience using OSIRIS using AVX or AVX2 and GNU/OpenMPI compilers concerning other kind of simulations (mainly relativistic collisionless shocks).

I fully understand the problem of human resources. That is why I proposed myself to implement it if it was not plan on the short time scale. I would however earn time if you tell me the source files that should be modified (1D Yee scheme, 1D Esirkepov, 1D current deposit and 1D interpolations of fields and related input deck parameters management). I can basically copy / past what has already been done for the 2D Cartesian case. For sure, it won't be with the "adaptative mode" in a first time such as the already implemented 2D Cartesian case.

Wishing you a good afternoon,
Best wishes,
Michaël

@xxirii
Copy link
Contributor

xxirii commented Jul 23, 2021

Hi Michael,

I can assist you in this work. First I would like to mention few things:

  • 1D vectorization has never been tried because we assumed the gain would be low and 1D simulations are never expensive. Obviously the last assumption is wrong if you have a tremendous number of particles, long simulations or/and do parametric studies (ensemble runs).
  • Current vectorization has been tuned for Intel compilers and it is very efficient compiled with the Intel compiler. Recent work on ARM and AMD processor shows that the code needs some small modification to run with full vectoization efficiency hen compiled with LLVM or GNU. On AMD processor, Smilei performance (compiled with Intel) is really good but we can see that the difference between scalar and vector modes is now as significant as on Intel processors. This is partly the consequence of the smaller vector register size (AVX2).
  • You will surely not be able to gain a speedup of 4. First because some parts of the code can not be vectorized and then because the interpolator and projector operators have specific memory access patterns that slow down vecto efficiency.
  • Note that vectorization induces sorting of the particle. This sorting induces an overhead that has to be compensated by vectorization.

Where you need to work:

The main operators that you have to vectorize are the interpolator and the projector. The pusher is already vectorized for all geometries. Maxwell solver is partially vectorized and quick enough.

You will have to work in the directories src/Interpolator (https://github.com/SmileiPIC/Smilei/tree/master/src/Interpolator) and src/Projector (https://github.com/SmileiPIC/Smilei/tree/master/src/Projector).

If we look at the Interpolator directory, you will see that there is a class per geometry (1D, 2D 3D...), order (2 and 4) and level of optimization (scalar or vecto). For instance Interpolator3D4Order.cpp is the class for the 3D order 4 interpolation and Interpolator3D4OrderV.cpp the same but vectorized.

You will therefore have to create a class called Interpolator1D2OrderV.cpp (at order 2). You can then use the vectorized algorithms in 2D and 3D to create a 1D version. You can also use our paper on vectorization to assist you.

Then to be able to use this class, you have to modify the factory: InterpolatorFactory.h. The factory determines according to the user input the correct functor to call. For instance, if you look at 3D:

        else if( ( params.geometry == "3Dcartesian" ) && ( params.interpolation_order == 2 ) ) {
            if( !vectorization ) {
                Interp = new Interpolator3D2Order( params, patch );
            }
#ifdef _VECTO
            else {
                Interp = new Interpolator3D2OrderV( params, patch );
            }
#endif

you can see that there is a if-condition specific to vectorization, you have to do the same for 1D.

The projector works the same way.

@michaeltouati
Copy link
Author

Hello Xxirii,

Thank you very much for all these important pieces of information.
It is very clear and, without it, it would have been impossible for me to implement it.

Do you confirm https://doi.org/10.1016/j.cpc.2019.05.001 is the paper that you mentioned?

@xxirii
Copy link
Contributor

xxirii commented Jul 23, 2021

Hi,

Do not hesitate to ask any question. Here or in Element. I confirm this is the paper I was mentioning.

@michaeltouati
Copy link
Author

Merci beaucoup

@xxirii
Copy link
Contributor

xxirii commented Jul 30, 2021

@michaeltouati

For your information, I will test in the next few days the possibility of 1D vectorized operators. I will let you know about the efficiency.

@xxirii
Copy link
Contributor

xxirii commented Aug 2, 2021

So, I have tried a vectorized version of the interpolator. You can switch the content of the function Interpolator1D2Order::fieldsWrapper by the following:

void Interpolator1D2Order::fieldsWrapper( ElectroMagn *EMfields, Particles &particles,
                                          SmileiMPI *smpi, int *istart, int *iend, int ithread, int ipart_ref )
{
    std::vector<double> *Epart = &( smpi->dynamics_Epart[ithread] );
    std::vector<double> *Bpart = &( smpi->dynamics_Bpart[ithread] );
    int *iold = &( smpi->dynamics_iold[ithread][0] );
    double *delta = &( smpi->dynamics_deltaold[ithread][0] );

    int nparts = particles.size();

    double * __restrict__ position_x = particles.getPtrPosition(0);

    double * __restrict__ Epart_x= &( smpi->dynamics_Epart[ithread][0*nparts] );
    double * __restrict__ Epart_y= &( smpi->dynamics_Epart[ithread][1*nparts] );
    double * __restrict__ Epart_z= &( smpi->dynamics_Epart[ithread][2*nparts] );

    double * __restrict__ Bpart_x= &( smpi->dynamics_Bpart[ithread][0*nparts] );
    double * __restrict__ Bpart_y= &( smpi->dynamics_Bpart[ithread][1*nparts] );
    double * __restrict__ Bpart_z= &( smpi->dynamics_Bpart[ithread][2*nparts] );

    double * __restrict__ Ex = EMfields->Ex_->data();
    double * __restrict__ Ey = EMfields->Ey_->data();
    double * __restrict__ Ez = EMfields->Ez_->data();
    double * __restrict__ Bx = EMfields->Bx_m->data();
    double * __restrict__ By = EMfields->By_m->data();
    double * __restrict__ Bz = EMfields->Bz_m->data();

    int idx;  //dual index computed
    int ipx;  //prim index computed
    double xjn;
    double xjmxi2;

    double coeffd[3];
    double coeffp[3];

    #pragma omp simd private(xjn, xjmxi2, coeffd, coeffp, idx, ipx)
    for( int ipart=*istart ; ipart<*iend; ipart++ ) {
        //Interpolation on current particle

        // Particle position (in units of the spatial-step)
        xjn = position_x[ipart]*dx_inv_;

        // Dual
        idx      = round( xjn+0.5 );      // index of the central point
        xjmxi  = xjn - ( double )idx +0.5; // normalized distance to the central node
        xjmxi2 = xjmxi*xjmxi;            // square of the normalized distance to the central node

        // 2nd order interpolation on 3 nodes
        coeffd[0] = 0.5 * ( xjmxi2-xjmxi+0.25 );
        coeffd[1] = ( 0.75-xjmxi2 );
        coeffd[2] = 0.5 * ( xjmxi2+xjmxi+0.25 );

        idx -= index_domain_begin;

        // Primal
        ipx      = round( xjn );    // index of the central point
        xjmxi  = xjn -( double )ipx; // normalized distance to the central node
        xjmxi2 = xjmxi*xjmxi;   // square of the normalized distance to the central node

        // 2nd order interpolation on 3 nodes
        coeffp[0] = 0.5 * ( xjmxi2-xjmxi+0.25 );
        coeffp[1] = ( 0.75-xjmxi2 );
        coeffp[2] = 0.5 * ( xjmxi2+xjmxi+0.25 );

        ipx -= index_domain_begin;

        // // Interpolate the fields from the Dual grid : Ex, By, Bz
        Epart_x[ipart] = coeffd[0] * Ex[idx-1]   + coeffd[1] * Ex[idx]   + coeffd[2] * Ex[idx+1];
        Bpart_y[ipart] = coeffd[0] * By[idx-1]   + coeffd[1] * By[idx]   + coeffd[2] * By[idx+1];
        Bpart_z[ipart] = coeffd[0] * Bz[idx-1]   + coeffd[1] * Bz[idx]   + coeffd[2] * Bz[idx+1];

        // Interpolate the fields from the Primal grid : Ey, Ez, Bx
        Epart_y[ipart] = coeffp[0] * Ey[ipx-1]   + coeffp[1] * Ey[ipx]   + coeffp[2] * Ey[ipx+1];
        Epart_z[ipart] = coeffp[0] * Ez[ipx-1]   + coeffp[1] * Ez[ipx]   + coeffp[2] * Ez[ipx+1];
        Bpart_x[ipart] = coeffp[0] * Bx[ipx-1]   + coeffp[1] * Bx[ipx]   + coeffp[2] * Bx[ipx+1];

        //Buffering of iol and delta
        iold[ipart] = ipx;
        delta[ipart] = xjmxi;
    }

}

Note that this version is supposed to update the scalar version and you don't have to activate the vectorization to use it.

I did some tests using the Intel compiler. You have a substantial gain between 1 and 128 particles per cell. The speed-up is then reduced for larger numbers. This is not the typical behavior but here we do not use the cell sorting as it is usually the case in vecto. I suspect the particles to generate more cache misses when loading the fields from the grids a they get mixed. The performance results may depend on the patch size and the case. Smaller patch will increase the patch coherency but increase the time spent in particle synchronization.

interpolator_time_per_particle_per_socket

The code is also written to provide good performance with GNU and LLVM compilers. But I did not test them yet.

@mccoys
Copy link
Contributor

mccoys commented Aug 2, 2021

Did we ever try such large numbers of particles per cell in 2D or 3D?

@michaeltouati
Copy link
Author

michaeltouati commented Aug 2, 2021

Thank you very much @xxirii. I will test it this week and/or next week using GNU and OpenMPI compilers together with AVX and AVX2 vectorization compiling options on my threadripper desktop.

I agree with you, it should depend on the case, the number of patches and the decomposition domain strategy, especially if no particle sorting is performed. What was the test case that you were simulating to perform this computational time scaling per particle per iteration per socket. If you were simulating a periodic collisionless plasma at Maxwell-Boltzmann equilibrium, it should have avoid this re-increase of computational time.

I agree with you @mccoys : in 2D, when resolving the Debye length to simulate a collisionless plasma, a number of macroparticles per cell higher than 128 is quite scarce so in 3D...

However, when simulating collisional plasmas using the multiple binary Coulomb collision Monte Carlo module, it is quite common to need more than 128 macroparticles per cell.

For my part, I recently had to consider more than 128 macroparticles per cell even for collisionless plasmas when simulating relativistic collisionless shocks on very large time scales in order to limit the numerical heating.

Thanking you a lot again,
Best wishes,
Michaël

@xxirii
Copy link
Contributor

xxirii commented Aug 3, 2021

@mccoys : Yes, I did. In this case the time per particle in vectorized mode stabilizes or slightly increases but the difference with the scalar mode stays high.

@michaeltouati I performed this test using a homogeneous well-balanced Maxwellian plasma. I always use such a case for preliminary performance tests.

@mccoys
Copy link
Contributor

mccoys commented Dec 1, 2021

@xxirii Is this issue partly fixed now?

@xxirii
Copy link
Contributor

xxirii commented Dec 1, 2021

It is now really an issue but a need for features. It is partially addressed with extension of the vectorization in 2D and vectorization of the interpolator in 1D. I think we can close it and keep it in our roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request something that could be added to the code
Projects
None yet
Development

No branches or pull requests

5 participants