Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up hot path of GFN2 geometry optimization #1178

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

foxtran
Copy link
Contributor

@foxtran foxtran commented Feb 5, 2025

This PR improves parallelization for GFN2 geometry optimization. It affects other types of calculations too.

Achieved speed-up is 30%. Gradients are 2x faster!

For 578 atom structure, before the patch:

 total:
 * wall-time:     0 d,  0 h,  1 min, 11.680 sec
 *  cpu-time:     0 d,  0 h,  3 min,  7.524 sec
 * ratio c/w:     2.616 speedup
 SCF:
 * wall-time:     0 d,  0 h,  0 min, 13.534 sec
 *  cpu-time:     0 d,  0 h,  0 min, 41.212 sec
 * ratio c/w:     3.045 speedup
 ANC optimizer:
 * wall-time:     0 d,  0 h,  0 min, 57.390 sec
 *  cpu-time:     0 d,  0 h,  2 min, 25.103 sec
 * ratio c/w:     2.528 speedup

normal termination of xtb

real 1m11.740s
user 3m5.985s
sys 0m1.572s

After the patch:

 total:
 * wall-time:     0 d,  0 h,  0 min, 48.779 sec
 *  cpu-time:     0 d,  0 h,  2 min, 51.122 sec
 * ratio c/w:     3.508 speedup
 SCF:
 * wall-time:     0 d,  0 h,  0 min, 11.737 sec
 *  cpu-time:     0 d,  0 h,  0 min, 42.249 sec
 * ratio c/w:     3.600 speedup
 ANC optimizer:
 * wall-time:     0 d,  0 h,  0 min, 36.377 sec
 *  cpu-time:     0 d,  0 h,  2 min,  8.025 sec
 * ratio c/w:     3.519 speedup

normal termination of xtb

real 0m48.971s
user 2m50.138s
sys 0m1.012s

Command for testing:

xtb --input model.xtb model.xyz -P 4 --gfn 2 --opt vtight --alpb water --cycles 5

model.xtb (nothing interesing, to be honest):

$opt
    engine=rf
$end
$alpb
    kernel=still
    grid=tight
$end

Build type: CMake, Release, gfortran-14, Intel MKL, no march (can get extra speed-up), enabled debug info.

@foxtran
Copy link
Contributor Author

foxtran commented Feb 5, 2025

@Albkat, @awvwgk, that is not funny that about 1/20 of all CI runs are failing :(

@Albkat Albkat self-requested a review February 5, 2025 19:59
@Albkat Albkat added the driver: optimization Related to the geometry optimization driver label Feb 5, 2025
@Albkat Albkat added this to the v6.7.2 milestone Feb 5, 2025
@foxtran
Copy link
Contributor Author

foxtran commented Feb 5, 2025

I have no idea why it does not work with ifort. Try to debug tomorrow

@foxtran
Copy link
Contributor Author

foxtran commented Feb 5, 2025

@Albkat, do not try it to restart: it will still fail. Most probably it is a bug in xtb code with undefined variables. I have a tool for debugging compiler :-)

@Albkat
Copy link
Member

Albkat commented Feb 5, 2025

Thanks, I'll test this.

@Albkat, @awvwgk, that is not funny that about 1/20 of all CI runs are failing :(

Regarding CI, there are some stochastic failures with PBC, GFN-FF, and CPCM-X that we are trying to identify. It is still a work in progress.

@Albkat, do not try it to restart: it will still fail. Most probably it is a bug in xtb code with undefined variables. I have a tool for debugging compiler :-)

Yeah, just double-checking if it is compiler-specific :)

@foxtran
Copy link
Contributor Author

foxtran commented Feb 5, 2025

Regarding CI, there are some stochastic failures with PBC, GFN-FF, and CPCM-X that we are trying to identify. It is still a work in progress.

They are reproducible. I did it for couple of them. :-)

@Albkat
Copy link
Member

Albkat commented Feb 5, 2025

Regarding CI, there are some stochastic failures with PBC, GFN-FF, and CPCM-X that we are trying to identify. It is still a work in progress.

They are reproducible. I did it for couple of them. :-)

I can reproduce them about 30% of the time when running sequentially.
Did you use anything special to freeze the executable state?

@foxtran
Copy link
Contributor Author

foxtran commented Feb 5, 2025

@Albkat, see #1182. I just do it for you right now :)

@foxtran foxtran force-pushed the feature/speedup-gfn2 branch from 0401508 to d9a5bd9 Compare February 6, 2025 09:18
@foxtran
Copy link
Contributor Author

foxtran commented Feb 6, 2025

@Albkat, it works now for legacy Intel Compilers :-)

@foxtran foxtran force-pushed the feature/speedup-gfn2 branch from d9a5bd9 to 1fbb43d Compare February 10, 2025 21:25
@foxtran
Copy link
Contributor Author

foxtran commented Feb 10, 2025

@Albkat, can we merge it? :-)

@Albkat
Copy link
Member

Albkat commented Feb 10, 2025

Seems legit to me, I'll test this tomorrow and then merge:)
It would be beneficial to understand what you did there, maybe we could use this in other places too...

@foxtran
Copy link
Contributor Author

foxtran commented Feb 10, 2025

It would be beneficial to understand what you did there, maybe we could use this in other places too...

I just started to use collapse(2) OpenMP statement :-)

Note, most of Fortran compilers does not support triangle cycles therefore one needs to give them rectangular grids and one need to be careful about which elements are skipped for better performance and which scheduler should be used.

@foxtran foxtran force-pushed the feature/speedup-gfn2 branch from 1fbb43d to 9c78931 Compare February 11, 2025 09:22
@foxtran
Copy link
Contributor Author

foxtran commented Feb 11, 2025

@Albkat, rebased on current master branch

@marvinfriede
Copy link
Member

I just started to use collapse(2) OpenMP statement :-)

Note, most of Fortran compilers does not support triangle cycles therefore one needs to give them rectangular grids and one need to be careful about which elements are skipped for better performance and which scheduler should be used.

It seems the rules on collapse are rather strict (see here).

I did some testing with the D4 (standalone) code and did not find any improvements: triangular iteration space with only outer loop parallel and rectangular iteration space with collapse(2) yielded the same timings. Just wanted to say that this requires testing and does not always help (but nice find here in xtb) :)

@foxtran
Copy link
Contributor Author

foxtran commented Feb 11, 2025

triangular iteration space with only outer loop parallel and rectangular iteration space with collapse(2) yielded the same timings.

How many atoms does test have?

@marvinfriede
Copy link
Member

triangular iteration space with only outer loop parallel and rectangular iteration space with collapse(2) yielded the same timings.

How many atoms does test have?

I think I tested 500 and 1000 atoms, maybe also 2000.

@foxtran
Copy link
Contributor Author

foxtran commented Feb 11, 2025

Hmm... That is interesting. I'm using gfortran for testing.

@marvinfriede
Copy link
Member

Hmm... That is interesting. I'm using gfortran for testing.

Same. I will take a look at my test setup again later. Maybe I missed something.

@toxtran
Copy link

toxtran commented Feb 11, 2025

BTW, do you check cpu time or real time? CPU time should not be changing significantly

@marvinfriede
Copy link
Member

BTW, do you check cpu time or real time? CPU time should not be changing significantly

I checked real time.

@foxtran
Copy link
Contributor Author

foxtran commented Feb 11, 2025

Ah... I found why it does not look so for your systems: the system which I'm using for testing has a extremely fast SCF and slow gradient part, while another system, which I generated, spends 2 min at SCF and only 10 seconds for gradients.

@foxtran
Copy link
Contributor Author

foxtran commented Feb 11, 2025

My system with 528 atoms, build from c8c187c:

Cycle 1:
     SCC iter.                  ...        0 min,  1.490 sec
     gradient                   ...        0 min,  3.551 sec
Cycle 2:
     SCC iter.                  ...        0 min,  3.293 sec
     gradient                   ...        0 min,  4.337 sec
     
 total:
 * wall-time:     0 d,  0 h,  0 min, 55.900 sec
 *  cpu-time:     0 d,  0 h,  4 min, 36.190 sec
 * ratio c/w:     4.941 speedup
 SCF:
 * wall-time:     0 d,  0 h,  0 min, 12.917 sec
 *  cpu-time:     0 d,  0 h,  1 min,  9.732 sec
 * ratio c/w:     5.399 speedup
 ANC optimizer:
 * wall-time:     0 d,  0 h,  0 min, 42.595 sec
 *  cpu-time:     0 d,  0 h,  3 min, 25.240 sec
 * ratio c/w:     4.818 speedup

Build from 9c78931:

Cycle 1:
     SCC iter.                  ...        0 min,  1.673 sec
     gradient                   ...        0 min,  1.913 sec
Cycle 2:
     SCC iter.                  ...        0 min,  3.203 sec
     gradient                   ...        0 min,  1.524 sec

 total:
 * wall-time:     0 d,  0 h,  0 min, 40.080 sec
 *  cpu-time:     0 d,  0 h,  4 min, 23.575 sec
 * ratio c/w:     6.576 speedup
 SCF:
 * wall-time:     0 d,  0 h,  0 min,  9.735 sec
 *  cpu-time:     0 d,  0 h,  1 min,  5.865 sec
 * ratio c/w:     6.766 speedup
 ANC optimizer:
 * wall-time:     0 d,  0 h,  0 min, 29.952 sec
 *  cpu-time:     0 d,  0 h,  3 min, 16.568 sec
 * ratio c/w:     6.563 speedup

@foxtran
Copy link
Contributor Author

foxtran commented Feb 11, 2025

I have also tried to make a script for reproducing my numbers :-)

git clone https://github.com/grimme-lab/xtb.git xtb.opt
cd xtb.opt
cmake -Bbuild_main -DCMAKE_BUILD_TYPE=Release
# -- The C compiler identification is GNU 14.1.0
# -- The Fortran compiler identification is GNU 14.1.0
# -- Cray Programming Environment 2.7.23 C
# -- Found OpenMP_Fortran: -fopenmp (found version "4.5")
# -- Found BLAS: implicitly linked
make -C build_main -j 40
git remote add foxtran https://github.com/foxtran/xtb.git
git fetch foxtran
git checkout foxtran/feature/speedup-gfn2
cmake -Bbuild_opt -D CMAKE_BUILD_TYPE=Release
make -C build_opt -j 40
mkdir -p build_main/check/
mkdir -p build_opt/check/
cat << EOF > gen.py
#!/usr/bin/env python

N = 20
Nat = N * N

out = [f"{Nat}", "Fluorine plane"]
for i in range(0, N):
  for j in range(0, N):
    out.append(f"F {i}.0 {j}.0 0.0")

open("build_main/check/F-plane.xyz", "w").write("\n".join(out))
open("build_opt/check/F-plane.xyz", "w").write("\n".join(out))
EOF
chmod +x gen.py
./gen.py
cd build_main/check/
OMP_NUM_THREADS=8 ../xtb F-plane.xyz -P 8 --gfn 2 --opt --alpb water --cycles 5
#  SCC (total)                   0 d,  0 h,  0 min, 21.385 sec
# .............................. CYCLE    1 ..............................
#      SCC iter.                  ...        0 min,  2.515 sec
#      gradient                   ...        0 min,  1.880 sec
# .............................. CYCLE    2 ..............................
#      SCC iter.                  ...        0 min, 17.930 sec
#      gradient                   ...        0 min,  2.432 sec
# .............................. CYCLE    3 ..............................
#      SCC iter.                  ...        0 min, 14.734 sec
#      gradient                   ...        0 min,  2.539 sec
# 
#  total:
#  * wall-time:     0 d,  0 h,  1 min, 44.294 sec
#  *  cpu-time:     0 d,  0 h, 10 min, 30.204 sec
#  * ratio c/w:     6.043 speedup
#  SCF:
#  * wall-time:     0 d,  0 h,  0 min, 21.394 sec
#  *  cpu-time:     0 d,  0 h,  2 min, 17.542 sec
#  * ratio c/w:     6.429 speedup
#  ANC optimizer:
#  * wall-time:     0 d,  0 h,  1 min, 22.522 sec
#  *  cpu-time:     0 d,  0 h,  8 min, 11.316 sec
#  * ratio c/w:     5.954 speedup
cd ../../
cd build_opt/check/
OMP_NUM_THREADS=8 ../xtb F-plane.xyz -P 8 --gfn 2 --opt --alpb water --cycles 5
#  SCC (total)                   0 d,  0 h,  0 min, 17.003 sec
# .............................. CYCLE    1 ..............................
#      SCC iter.                  ...        0 min,  2.285 sec
#      gradient                   ...        0 min,  1.033 sec
# .............................. CYCLE    2 ..............................
#      SCC iter.                  ...        0 min, 14.642 sec
#      gradient                   ...        0 min,  1.102 sec
# .............................. CYCLE    3 ..............................
#      SCC iter.                  ...        0 min, 11.684 sec
#      gradient                   ...        0 min,  1.213 sec
# 
#  total:
#  * wall-time:     0 d,  0 h,  1 min, 21.159 sec
#  *  cpu-time:     0 d,  0 h,  8 min, 56.798 sec
#  * ratio c/w:     6.614 speedup
#  SCF:
#  * wall-time:     0 d,  0 h,  0 min, 17.014 sec
#  *  cpu-time:     0 d,  0 h,  1 min, 58.051 sec
#  * ratio c/w:     6.939 speedup
#  ANC optimizer:
#  * wall-time:     0 d,  0 h,  1 min,  3.859 sec
#  *  cpu-time:     0 d,  0 h,  6 min, 57.683 sec
#  * ratio c/w:     6.541 speedup

As you can notice, the parallelization became a little bit better. However, if SCF sucks, this patch does not provide significant improvements :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
driver: optimization Related to the geometry optimization driver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants