Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up k_find_block_bounds #1105

Merged
merged 5 commits into from
Aug 3, 2023
Merged

Conversation

badisa
Copy link
Collaborator

@badisa badisa commented Jul 29, 2023

  • Speeds up the k_find_block_bounds by ~3x (24k ns to 7.5k ns) by using shuffle syncs
  • Exposed that we have a gap in the tests, as another implementation passed all the tests but resulted in a 30% slow down due to certain cases of box bounds being much larger. Still investigating

Why bad implementations of k_find_block_bounds could pass when generating bad boxes

The issue was that the python was actually imaging the coordinates prior to passing them into the Neighborlist (https://github.com/proteneer/timemachine/blob/master/tests/test_nblist.py#L41-L59) which resulted in never actually testing that PBCs were handled correctly in the GPU kernel.

Benchmarks

A10 Cuda Arch 86

Ranges from a <1% speed up to a 3% speed up. The cases where it is most significant is where there are more neighborlists which is going to be the non-apo cases where there are all pairs + ixn groups which result in more neighborlist evaluations

Master

dhfr-apo: N=23558 speed: 694.34ns/day dt: 2.5fs (ran 100000 steps in 31.11s)
dhfr-apo-barostat-interval-25: N=23558 speed: 622.48ns/day dt: 2.5fs (ran 100000 steps in 34.70s)
building a protein system with 1758 protein atoms and 7047 water atoms
hif2a-apo: N=8805 speed: 1177.04ns/day dt: 2.5fs (ran 100000 steps in 18.35s)
hif2a-apo-barostat-interval-25: N=8805 speed: 1005.22ns/day dt: 2.5fs (ran 100000 steps in 21.49s)
hif2a-rbfe: N=8840 speed: 901.40ns/day dt: 2.5fs (ran 100000 steps in 23.97s)
hif2a-rbfe-local: N=8840 speed: 1298.07ns/day dt: 2.5fs (ran 100000 steps in 16.64s)
solvent-apo: N=6282 speed: 1459.73ns/day dt: 2.5fs (ran 100000 steps in 14.80s)
solvent-apo-barostat-interval-25: N=6282 speed: 1286.07ns/day dt: 2.5fs (ran 100000 steps in 16.80s)
solvent-rbfe: N=6317 speed: 1156.37ns/day dt: 2.5fs (ran 100000 steps in 18.68s)
solvent-rbfe-local: N=6317 speed: 1413.71ns/day dt: 2.5fs (ran 100000 steps in 15.28s)

Shuffle syncing across threads

dhfr-apo: N=23558 speed: 698.51ns/day dt: 2.5fs (ran 100000 steps in 30.93s)
dhfr-apo-barostat-interval-25: N=23558 speed: 626.74ns/day dt: 2.5fs (ran 100000 steps in 34.47s)
building a protein system with 1758 protein atoms and 7047 water atoms
hif2a-apo: N=8805 speed: 1198.14ns/day dt: 2.5fs (ran 100000 steps in 18.03s)
hif2a-apo-barostat-interval-25: N=8805 speed: 1026.80ns/day dt: 2.5fs (ran 100000 steps in 21.04s)
hif2a-rbfe: N=8840 speed: 918.68ns/day dt: 2.5fs (ran 100000 steps in 23.52s)
hif2a-rbfe-local: N=8840 speed: 1330.03ns/day dt: 2.5fs (ran 100000 steps in 16.24s)
solvent-apo: N=6282 speed: 1465.22ns/day dt: 2.5fs (ran 100000 steps in 14.74s)
solvent-apo-barostat-interval-25: N=6282 speed: 1289.07ns/day dt: 2.5fs (ran 100000 steps in 16.76s)
solvent-rbfe: N=6317 speed: 1190.88ns/day dt: 2.5fs (ran 100000 steps in 18.14s)
solvent-rbfe-local: N=6317 speed: 1440.83ns/day dt: 2.5fs (ran 100000 steps in 15.00s)

Kernel Timings

Master

CUDA Kernel Statistics:

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     63.4    1,428,305,504      4,000  357,076.4  356,785.5   304,321   382,817      9,449.2  void k_nonbonded_unified<float, (int)256, (bool)0, (bool)1, (bool)0>(int, int, const unsigned int *…
      9.1      204,626,377        791  258,693.3  258,561.0   221,601   282,913      8,708.7  void k_find_blocks_with_ixns<float, (bool)1>(int, int, int, const unsigned int *, const unsigned in…
      5.8      130,363,148      4,000   32,590.8   32,896.0    27,232    35,040      1,105.7  void gen_sequenced<curandStateXORWOW, double2, normal_args_double_st, &curand_normal_scaled2_double…
      3.2       72,540,683      4,160   17,437.7   17,984.0     8,384    27,744      3,208.7  void k_nonbonded_pair_list<float, (bool)1>(int, const double *, const double *, const double *, con…
      2.9       66,110,936      4,160   15,892.1   15,680.0     4,480    29,760      2,880.2  void k_harmonic_angle<float, (int)3>(int, const double *, const double *, const int *, unsigned lon…
      2.7       61,731,830      4,160   14,839.4   15,104.0     5,248    30,592      2,392.9  void k_periodic_torsion<float, (int)3>(int, const double *, const double *, const int *, unsigned l…
      2.6       57,621,107        160  360,131.9  360,881.5   309,473   383,297      9,077.0  void k_nonbonded_unified<float, (int)256, (bool)1, (bool)0, (bool)0>(int, int, const unsigned int *…
      2.2       50,350,006      4,160   12,103.4   12,032.5     3,712    24,928      3,194.0  void k_harmonic_bond<float>(int, const double *, const double *, const int *, unsigned long long *,…
      1.8       40,191,893      4,160    9,661.5    9,471.0     7,520    19,680        767.8  void k_gather_coords_and_params<double>(int, const unsigned int *, const T1 *, const T1 *, T1 *, T1…
      1.8       39,903,985      4,000    9,976.0    9,952.0     8,480    14,240        281.3  void update_forward_baoab<double>(int, int, T1, const unsigned int *, const T1 *, const T1 *, const…
      1.6       36,390,209      4,118    8,836.9    7,712.0     4,512    21,408      2,667.1  void k_check_rebuild_coords_and_box_gather<float>(int, const unsigned int *, const double *, const …
      1.0       21,906,566      4,000    5,476.6    5,472.0     4,512    16,160        446.5  void k_scatter_accum<unsigned long long>(int, const unsigned int *, const T1 *, T1 *)               
      0.9       19,221,938        791   24,300.8   24,192.0    20,480    32,640      1,012.9  void k_find_block_bounds<float>(int, int, int, const unsigned int *, const double *, const double *…

Changes

CUDA Kernel Statistics:

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     63.6    1,439,757,838      4,000  359,939.5  360,097.5   306,049   381,089      9,011.0  void k_nonbonded_unified<float, (int)256, (bool)0, (bool)1, (bool)0>(int, int, const unsigned int *…
      9.2      209,232,553        791  264,516.5  265,153.0   220,289   288,673      8,770.9  void k_find_blocks_with_ixns<float, (bool)1>(int, int, int, const unsigned int *, const unsigned in…
      5.8      131,297,225      4,000   32,824.3   33,121.0    27,328    35,040      1,083.6  void gen_sequenced<curandStateXORWOW, double2, normal_args_double_st, &curand_normal_scaled2_double…
      3.3       74,365,234      4,160   17,876.3   18,176.0     8,448    31,040      3,075.1  void k_nonbonded_pair_list<float, (bool)1>(int, const double *, const double *, const double *, con…
      2.9       66,468,324      4,160   15,978.0   15,744.0     4,512    29,344      2,778.2  void k_harmonic_angle<float, (int)3>(int, const double *, const double *, const int *, unsigned lon…
      2.8       62,365,516      4,160   14,991.7   15,296.0     5,248    27,840      2,368.7  void k_periodic_torsion<float, (int)3>(int, const double *, const double *, const int *, unsigned l…
      2.6       58,029,618        160  362,685.1  363,505.0   310,817   381,217     10,392.3  void k_nonbonded_unified<float, (int)256, (bool)1, (bool)0, (bool)0>(int, int, const unsigned int *…
      2.3       51,153,507      4,160   12,296.5   12,256.0     3,904    26,113      3,271.7  void k_harmonic_bond<float>(int, const double *, const double *, const int *, unsigned long long *,…
      1.8       40,345,491      4,160    9,698.4    9,536.0     7,584    18,976        704.3  void k_gather_coords_and_params<double>(int, const unsigned int *, const T1 *, const T1 *, T1 *, T1…
      1.8       40,202,137      4,000   10,050.5   10,047.0     8,576    11,104        250.7  void update_forward_baoab<double>(int, int, T1, const unsigned int *, const T1 *, const T1 *, const…
      1.6       37,210,946      4,118    9,036.2    7,872.0     4,543    22,816      2,808.5  void k_check_rebuild_coords_and_box_gather<float>(int, const unsigned int *, const double *, const …
      1.0       21,974,240      4,000    5,493.6    5,504.0     4,512    14,976        339.6  void k_scatter_accum<unsigned long long>(int, const unsigned int *, const T1 *, T1 *)               
      0.5       12,369,055        791   15,637.2   15,616.0    12,576    23,840        833.1  k_compact_trim_atoms(int, int, unsigned int *, unsigned int *, int *, unsigned int *)               
      0.3        7,454,094      1,120    6,655.4    5,952.0     3,168    17,440      3,561.6  void k_accumulate_energy<(unsigned int)512>(int, const __int128 *, __int128 *)                      
      0.3        6,211,641        791    7,852.9    7,808.0     6,688    16,640        519.7  void k_find_block_bounds<float>(int, int, const unsigned int *, const double *, const double *, T1 …

* There is something missing in the tests
@badisa badisa changed the title Speed up k_find_bounding_blocks Speed up k_find_block_bounds Jul 29, 2023
block_size = 32
def reference_block_bounds(coords: NDArray, box: NDArray, block_size: int) -> Tuple[NDArray, NDArray]:
# Make a copy to avoid modify the coordinates that end up used later by the Neighborlist
coords = coords.copy()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't have this copy, you can end up re-centering all of the coordinates which results in not testing the PBCs in the kernel. This is an issue on master, but thankfully the kernel was implemented correctly. But when the kernel incorrectly handled PBCs, tests would still pass. With this copy and a bad implementation, it does correctly fail.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

@badisa badisa marked this pull request as ready for review July 31, 2023 21:37
@badisa badisa merged commit c35a5f7 into master Aug 3, 2023
@badisa badisa deleted the perf/speed-up-find-block-bounds branch August 3, 2023 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants