-
Notifications
You must be signed in to change notification settings - Fork 188
Proceedings 2024 ESPResSo meetings
Jean-Noël Grad edited this page Dec 17, 2024
·
13 revisions
Hardware at the ICP:
- 1152 cores and 36 GPUs on the Ant cluster
- 500 cores and 40 GPUs on the HTCondor infrastructure
Benchmarks:
- single-core performance: cores on HTCondor have 30% more performance than cores on Ant
- multi-core performance: on HTCondor, performance stops improving after 4 cores for simulations with 10k particles, and after 8 cores for 100k particles, while for GPU simulations 1 core is usually sufficient
- use shared memory parallelism on the node level (rather than distributed memory) to reduce footprint of ghost particles calculations
- use struct-of-arrays for particle lists
- LJ simulation prototype implemented using Kokkos+Cabana
- feature name change:
CUDA
->ESPRESSO_CUDA
, etc. (#4974) - reduce number of CMake configuration options, e.g. with a unified
-D ESPRESSO_BUILD_WITH_ARCHS="scalar,avx2,cuda"
option - unify Python classes for CPU, AVX2 and GPU kernels with an extra argument
arch
:-
arch="cpu:auto"
: default value, maximal portability, selects the fastest kernel available for the current hardware -
arch="cpu:avx2"
: selects AVX2 kernels if supported by the hardware (Intel, AMD), otherwise raise error -
arch="cpu:neon"
: selects Neon kernels if supported by the hardware (ARM), otherwise raise error -
arch="cpu:scalar"
: selects scalar kernels (no vectorization, slowest) -
arch="gpu:auto"
: selects the GPU kernels against which ESPResSo was built if a matching GPU is available, otherwise raise error -
arch="gpu:cuda"
: selects the CUDA GPU kernels if an Nvidia GPU is available, otherwise raise error -
arch="gpu:rocm"
: selects the ROCm GPU kernels if an AMD GPU is available, otherwise raise error
-
- for a column system, where each MPI rank contains a cubic slice of the column, optimal performance is achieved by orienting the column main axis along the z-direction
- for a cuboid system, slicing along the z-direction is also the optimal communication pattern, since slicing along 2 or 3 directions introduces communication with extra partners
- the default MPI Cartesian topology in ESPResSo is in descending order, for multi-GPU LB the user must manually set it to ascending order
- due to padding of GPU fields, the memory footprint is minimized when the size of the rank-local LB domain in agrid units along the x-direction is an integer multiple of 64 (single-precision) or 32 (double-precision)
- the formula is relatively simple to derive and involves a stepwise function
- every time we enter a new step, increasing the size along the x-direction is essentially free since the new data replaces the existing padding, until we reach the next step in the curve
- Alex: validated, one missing link to the user guide
- JN: validated, one missing link to the book chapter
- Julian: validated
- Sam: still a work in progress
- not present: Keerthi, David
- on CPU, the development branch of ESPResSo outperforms the 4.2.2 release (#4921)
- on GPU, GDRcopy is needed to remove a performance bottleneck in multi-GPU simulations
- progress was made during and after the coding day in improving the script interface, introducing the ZnDraw visualizer in tutorials, fixing a corner case of the Lees-Edwards collision operator in LB, and fixing regressions in the Python implementation of Monte Carlo
- there is a recurring issue with the difficulty level of C++ tasks
- the core team needs to improve onboarding of C++ developers
- MetaTensor integration in ESPResSo is challenging due to dependencies
- need to find test cases based on current ML research done with ESPResSo
- now encapsulated: non-bonded and bonded interactions, collision detection, particle list, cluster structure analysis, OIF, IBM, auto-update accumulators, constraints, MPI-IO, MMM1D (#4950)
- new API: several features now take an ESPResSo system as argument: Cluster Structure, MPI-IO
- in the future, more features will take a system or particle slice as argument, e.g. Observables (#4954)
-
system.thermostat
andsystem.integrator
will be removed in favor ofsystem.propagation
- possible API: JSON data structure
- easy to read from a parameter file
- conveys the hierarchical nature of mixed propagation modes
- avoids the ambiguity of similarly named parameters, e.g. "gamma" for both Langevin and Brownian, but "gamma0" and "gammaV" for NpT
system.propagation.set(kT=1., translation={"Langevin": {"gamma": 1.}, "LB": {"gamma": 2.}}, rotation={"Euler"})
- more details in #4953 and in an upcoming announcement on the mailing list
- Tuesday, August 6, 2024
- see mailing list for more details
- version requirements of many dependencies were updated
- experimental support for multi-GPU LB is underway
- requires a suitable CUDA-aware MPI library
- for now, use one GPU device per MPI rank
- long-term plan: use one GPU device and multiple OpenMP threads per MPI rank
- planned removal/replacement of the GPU implementations of long-range solvers
- vector field visualization (LB velocities)
- bacterial growth simulation (non-constant number of particles)
- raytracing of porous media
- red blood cell transport in capillary blood vessels
- LB GPU now works in parallel
- CUDA-aware MPI is still a work-in-progress
- work on multi-GPU support has just started
- long-term plan: multi-GPU support with 1 GPU per MPI rank and multiple shared memory threads per MPI rank via OpenMP
- multi-system simulations are now possible for almost all features
- caveats: two systems cannot have particles with the same particle ids, Monte Carlo not yet supported
- can be enabled with a one-liner change to
system.py
(see last commit in jngrad/multiverse)
- convert particle cells from AoS to SoA (#4754), i.e. one array per particle property
- improves cache locality and CPU optimizations
- use Cabana to hide optimizations
- bump all version requirements (#4905)
- on ICP workstations, only need to update formatting and linter tools with
pip3 install -r requirements.txt autopep8 pycodestyle pylint pre-commit
- ModEMUS project: nanoparticle diffusion in hydrogel network, by Pablo Blanco (NTNU) (see Ma et al. 2018)
- currently implemented with Langevin, plan is to use LB instead to improve accuracy
- ultrasound streaming could be modeled with a gradient pressure via pressure boundary conditions
- GPU LB with particle coupling implemented in #4734
- requires CUDA>=12.0 to make double-precision
atomicAdd()
available - performance is degraded when using more than 1 MPI rank to communicate to the same GPU, need to look into CUDA-aware MPI
-
pyMBE-dev/pyMBE now uses EESSI for CI/CD
- continuous integration PR: pyMBE-dev/pyMBE#1
- continuous delivery PR: pyMBE-dev/pyMBE#23
- deployed docs: https://pymbe-dev.github.io/pyMBE
- Thursday: HPC team meeting to discuss software stack
- Monday: set up the software stack
- Tuesday: online meeting with the company
- main objectives of the CoE MultiXscale:
- EESSI: "app store" for scientific software
- multiscale simulations with 3 pilot cases: helicopter blades turbulent flow, ultrasound imaging of living tissues, energy storage devices
- make software pre-exascale ready
- training on using these software
- ongoing projects for the ICP:
- ZnDraw+SiMGen project in collaboration with Gábor Csányi
- demo available at https://zndraw.icp.uni-stuttgart.de/
- LB boundaries in the waLBerla version of ESPResSo are broken when using 2 or more MPI ranks
- the LB ghost layer doesn't contain any information about boundaries, so fluid can flow out into the neighboring LB domain, where it gets trapped into the boundary
- more details can be found in the bug report #4859
- the solution will require a ghost update after every call to a node/slice/shape boundary velocity setter function
- it is now possible to choose the exact equations of motions solved for each particle
- more combinations of integrators and thermostats are allowed
- one can mix different types of virtual sites in the same simulation
- a new Python interface needs to be designed (target for next meeting)
- CUDA 12.0 is now the default everywhere at the ICP
- GTX 980 GPUs are being removed from ICP workstations (GTX 1000 and higher are needed for double-precision
atomicAdd
) - Python 3.12 changes the way we use the unittest module (#4852)
- Python virtual environments becomes mandatory for
pip install
in Ubuntu 24.04- user guide needs to reflect that change
- ZnDraw bridge is currently being developed for ESPResSo
- supports live visualization in the web browser
- the plan is to re-implement all features available in the ESPResSo OpenGL visualizer
- CUDA 12 will be the default in Ubuntu 24.04, due April 2024
- CUDA 12 is required for C++20, which makes contributing to ESPResSo significantly easier (#4846)
- compute clusters and supercomputers where ESPResSo is currently used already provide compiler toolchains compatible with CUDA 12