Vectorize/Parallelize the clustering algorithm for the fit results #44

RodenLuo · 2024-12-03T20:32:58Z

DiffFit has been calling the same clustering algorithm as the fitmap command in ChimeraX. The algorithm relies on the Binned_Transforms class and its close_transforms and add_transform methods from chimerax.geometry.bins. This step is observed to be a rate-limiting step in quite some cases and I'm tempted to speed it up.

Before I knew the built-in chimerax.geometry.bins, I once implemented an algorithm to approximately solve it:

Cluster the shifts (x, y, z)
For each cluster from step 1, use the quaternions to transform two unit orthogonal vectors and get a 6-dimension vector (v1_x, v1_y, v1_z, v2_x, v2_y, v2_z,)
For each cluster from step 1, cluster all the 6-dimension vectors

It was buggy and generated more clusters than chimerax.geometry.bins. Often, two clusters were actually within the threshold and thus should be just one cluster. (I unfortunately could not figure out why...)

I wonder if you have any idea how to speed up the clustering. Ideally, the required computation should be vectorized so that it can be done on PyTorch. Or at least, it should be possible to parallelize it through, for example, OpenMP.

Many thanks!

The text was updated successfully, but these errors were encountered:

tomgoddard · 2024-12-03T21:25:57Z

Are you sure the binning code is the bottleneck? The ChimeraX code is simple and not intended for highest performance. It simply hashes the translation and rotation angle in bins of the requested size (translation step size and angle step size). That should be very fast. The way the ChimeraX fitmap search command uses it is that each time a new fit is computed it checks if it is near a previous fit. If it is near a previous one it does not add it to the binning. This way the bins don't fill up with fits. If you let the bins fill up with fits and then start doing lookups of all close transforms it could get slow.

RodenLuo · 2024-12-04T11:44:24Z

Thanks, Tom! DiffFit first optimizes all fits in parallel. When it reaches clustering, it loops through all filtered fits (e.g., by levelInside) and uses the same way as the fitmap search command, i.e., only b.add_transform(ptf) when b.close_transforms(ptf) == 0.

I list a quick benchmark below. (The second column records the time to convert from an array of shifts and quaternions to chimerax.geometry.Place. Not vectorized yet as it needs Place.) It does look like when the final number of clusters reaches higher, the bin clustering takes much longer.

# Fits	Convert to Matrix/Place	bin clustering	# Clusters
250,000	2.1	72.2	5124
210,648	1.7	45.8	2882
36,267	0.3	8.4	1435
20,839	0.2	2.9	382
15,475	0.1	2.2	137

The clustering is for sure not the longest computation. Optimizing 500 shifts x 500 quaternions = 250,000 fits takes 4.5 mins (can be further improved, as currently, it involves a saving of all parameters every 10 iterations). But I feel the clustering should be done in less than a second.

tomgoddard · 2024-12-04T19:32:14Z

I agree that is slow, about 3500 fits/second for bin clustering (=250000 / 72.2). Numpy is slow creating and operating on small arrays such as the 3x4 transform matrix used by Place. If you attached your Python benchmark script I might try it. Of course you can profile Python code and find out where the time is going. I agree this could run 100 to 1000 times faster if implemented for speed. But the first step is to figure out where all the time is going in the current unoptimized Python implementation.

RodenLuo · 2024-12-04T21:33:04Z

The attached fit_res_clustering_benchmark.zip should be roughly the minimal code and data to reproduce the first line in the table. Put the contents of the zip file in the working directory (usually your Desktop). And run runscript fit_res_clustering_benchmark.py in the ChimeraX command line. I get the following log.

Clustering 250000 fits
Convert to matrix time: 0:00:02.245692
ChimeraX bin clustering: 0:01:16.228472

You may uncomment Line 17 in fit_res_clustering_benchmark.py to "mimic" (as the real run above involves some filtering) the last line in the benchmark table. I get the following log.

Clustering 15475 fits
Convert to matrix time: 0:00:00.131463
ChimeraX bin clustering: 0:00:02.478661

Thanks!

RodenLuo · 2024-12-04T21:34:49Z

The script in text form:

from scipy.spatial.transform import Rotation as R
from chimerax.geometry import bins
from chimerax.geometry import Place
import numpy as np
from datetime import datetime
from math import pi


fit_res = np.load("fit_res_filtered.npz")
fit_res_filtered = fit_res['fit_res_filtered']
mol_center = fit_res['mol_center']

mol_shift = fit_res_filtered[:, :3]
mol_q = fit_res_filtered[:, 3:7]

Total_fits = len(mol_shift)
# Total_fits = 15475 # to mimic other records in the reported table

print(f"Clustering {Total_fits} fits")
timer_start = datetime.now()

T = []
for i in range(Total_fits):
    shift = mol_shift[i]
    quat = mol_q[i]
    R_matrix = R.from_quat(quat).as_matrix()

    T_matrix = np.zeros([3, 4])
    T_matrix[:, :3] = R_matrix
    T_matrix[:, 3] = shift

    transformation = Place(matrix=T_matrix)
    T.append(transformation)

print(f"Convert to matrix time: {datetime.now() - timer_start}")
timer_start = datetime.now()

angle_tolerance = 6.0
shift_tolerance = 3.0

b = bins.Binned_Transforms(angle_tolerance * pi / 180, shift_tolerance, mol_center)
mol_transform_label = []
unique_id = 0
T_ID_dict = {}
for i in range(Total_fits):
    ptf = T[i]
    close = b.close_transforms(ptf)
    if len(close) == 0:
        b.add_transform(ptf)
        mol_transform_label.append(unique_id)
        T_ID_dict[id(ptf)] = unique_id
        unique_id = unique_id + 1
    else:
        mol_transform_label.append(T_ID_dict[id(close[0])])
        T_ID_dict[id(ptf)] = T_ID_dict[id(close[0])]

print(f"ChimeraX bin clustering: {datetime.now() - timer_start}")

tomgoddard · 2024-12-06T02:30:55Z

I my Mac Studio M2 Ultra your script takes 27 seconds to bin the 250,000 fits where your machine takes 76 seconds. There are only 5124 clusters, so averaging 50 fits per cluster with your 6 degree, 3 A cluster size. There is an additional parameter to the Binned_Transforms() called bfactor, default 10, which is how much larger the hashed bins are than the angle and translation size. The large bins 60 degrees and 30 A are going to put a lot of fits in one bin, and then the code searches all of those in a bin for close ones which will be slow. If I use bfactor = 1 or 2 in the constructor the total time reduces to 12 seconds, a factor of 2 speed up. You might think it would be faster but a couple factors slow it down. First you have many fits that are nearly identical (within 1 degree, and 1 Angstrom), so the smaller bins still have a ton of fits. And then there is the extra cost that with smaller bins, the code may be looking at 16 times more neighbor bins (the bins are 4D, rotation angle + x,y,z translation axes) and those hash lookups start to take time. Another slow aspect of the code is that it finds all close transforms instead of just finding if there is at least one. I tested how fast it might be if it just found at least one transform in the same bin and total time was 4 seconds. So it is probably possible to make small improvements to the Python code and make it 7x faster for your use case. That is still very slow, but is not too surprising given that it is Python and not vectorized. I'd say an optimized C++ version would be about 300x faster than the current Python. But as usual it is a tradeoff between slow and simple with little time developing the code, or fast and complex and a lot of time developing the code.

RodenLuo · 2024-12-06T16:52:42Z

I tested how fast it might be if it just found at least one transform in the same bin and total time was 4 seconds.

How did you achieve this?

Many thanks! bfactor=2 consistently gives higher performance when the number of fits varies from 1,000 to 250,000. Although with 1,000 fits, the speedup is marginal (0.084 -> 0.075). With > 35,000 fits, the speedup is by a factor of 2. Often, DiffFit has at least 10x100=1000 fits. I experiment for bfactor=[1, 2, 3, 4] and find 2 is the best. So, I am adding bfactor=2 to the code base now.

Logic-wise, I use a dictionary with id(ptf) as a key to record each and every fit's unique cluster ID. The unique cluster ID is generated when a fit is not close to any other and thus is added to the bins. I only use the ID of the first element in the returned list from close_transforms as a key to fetch the unique cluster's ID when a fit is not unique. This does have a caveat that the returned cluster might not be the closest to the fit being tested. I then find from each cluster a fit with the highest occupied density to represent that cluster. This means it is possible for two representatives to fall within the shift and angle tolerances and thus create a few redundancies for downstream analyses. I guess this might be a common issue in clustering problems. It doesn't matter much in DiffFit's case as the redundancy can be fairly easily removed by visualization and user interaction or a downstream testing (or "assembly") algorithm.

I thus extend Binned_Transforms to DiffFit_Binned_Transforms (attached at the end) to return the first hit only and get rid of the len(). But I only get a speedup from 36 seconds to 29 seconds for 250,000 fits with bfactor=2 (from 76 sec to 67 sec with bfactor=10).

class DiffFit_Binned_Transforms(Binned_Transforms):
    def __init__(self, angle, translation, center=(0, 0, 0), bfactor=2):
        super().__init__(angle, translation, center, bfactor)

    def closest_transforms(self, tf):

        a, x, y, z = c = self.bin_point(tf)
        clist = self.bins.close_objects(c, self.spacing)
        if len(clist) == 0:
            return None

        itf = tf.inverse()
        d2max = self.translation * self.translation
        for ctf in clist:
            cx, cy, cz = ctf * self.center
            dx, dy, dz = x - cx, y - cy, z - cz
            d2 = dx * dx + dy * dy + dz * dz
            if d2 <= d2max:
                dtf = ctf * itf
                a = dtf.rotation_angle()
                if a < self.angle:
                    return ctf

        return None

tomgoddard · 2024-12-06T19:27:31Z

Hi Roden,

Your DiffFit_Binned_Transforms doesn't speed things up much because all the time is spent in self.bins.close_objects(c, self.spacing) and you didn't change that code. The 4 second time I mentioned was not a complete solution, it merely checked if the bin containing the transform had another transform and used bfactor = 1 so that the bin size was the 6 degrees and 3 Angstrom tolerance. That quickly determines that a nearby transform exists. But if the bin does not have a transform, then we fallback to the slow method. I've attached the modifications to your benchmark script I tried.

You are right you are not going to get clusters defined by the nearest transforms with the existing code -- I wrote it for ChimeraX to maximize simplicity without sacrificing speed by insisting on a nice conceptual definition of clusters. These trade-offs are typical of clustering methods. I hope you understand ChimeraX is full of "good enough" simple to code solutions. You can always make better solutions with more complex code. But we try to provide the best value for researchers and that means not spending 5x more coding, debugging, maintenance time to produce a somewhat nicer solution.

Tom

fit_res_clustering_benchmark_mod.zip

RodenLuo · 2024-12-09T20:16:22Z

Hi Tom,

Many thanks!! I went through the two related classes a bit and realized your speed-up method could be a complete solution as you have already stored the first Place object in the newly added bin (b.bins.bins[cbin][0][1]). The attached solution (fit_res_clustering_benchmark_mod_complete.zip) gives me a speed up from 30 sec to 12 sec. The only thing is the final clusters dropped from 5124 to 4580. I checked at the final practical level for a few cases, and all correct fits can still be identified. I thus didn't try to figure out why and believed this should be a sound solution.

I totally understand for such a giant software system, there are many factors that need to be considered, and trade-offs need to be made. ChimeraX, in my humble opinion, is a great tool. I have learned a lot from it. Once again, very appreciated, and many thanks!

With the current speed, I wouldn't say the clustering is so rate-limiting and needs speed up. Projection to the future, I might come back if the other parts get significant speed up later. I do have a question then. If I understand the code correctly, only one float number is used to encode the rotation, a = tf.rotation_angle(). I don't understand how this is sound unless this is a very rough estimation, given that it comes from either 3 (Euler angles) or 4 (quaternion) floats. I had considered using 6 floats (two orthogonal unit vectors transformed by the quaternion) to encode the rotation for the clustering purpose.

Very best,
Roden

tomgoddard · 2024-12-09T22:05:58Z

Hi Roden,

I don't think your code is correct. You should get exactly the same number 5124 clusters. I suspect the problem is that finding a transform already in a bin does not guarantee that it is near the query transform since as you noted only the rotation angle is binned, not the rotation axis. So your current code is deciding not to make a new cluster based on the inadequate test that there is already a transform in its bin. It needs a further test that some transform in that bin is actually close to the query transform.

It is of course possible to bin using the rotation axis and angle. But my code only uses the angle and then checks the elements in the bin to see if they are actually close (done in the Binned_Transforms.close_transforms() method). Why did I do it this way? Because it is faster running and a bit simpler.

Tom

RodenLuo · 2024-12-10T08:58:26Z

Aha, I see now. Even though bfactor=1, the transforms in the bins are still only "close" transforms rather than "in cluster" transforms. The Binned_Transforms.close_transforms() method returns the "in cluster" transforms. This is really a smart and concise method. I fall back the code base to my previous solution. I will give it some thinking later this week on how to vectorize the clustering.

Many thanks!

tomgoddard · 2024-12-11T01:49:27Z

The transforms in a bin are not necessarily even remotely close. They have the same translation and rotation angle, but maybe the rotation angle is 90 degrees and the axes of two rotations are 90 degrees apart so the rotations are completely different. The binning is for fast lookup. The bins are not intended to be close transforms. They are just used to greatly reduce the number of transforms that need to be compared. Still the optimization where you start by looking in the bin that the query transform would be in and stop as soon as you find one transform within the threshold distances will probably significantly accelerate the search and is pretty easy to code. In optimizing you usually try such easy improvements first and only move on to harder optimizations if those don't give enough speed-up.

no mol_center related calculations promoted d2max to be a class attribute 31-sec to 23-sec speedup for the case in #44

RodenLuo · 2024-12-17T11:22:55Z

Hi Tom, Thanks for the guidance! The attached DiffFit_bins.zip includes my progress towards this issue. It has the following files.

fit_res_clustering_benchmark_cProfile.py # The previously worked code with cProfile
fit_res_clustering_benchmark_cProfile.log # The profile result, are_coordinates_close stands out
fit_res_clustering_benchmark.py and DiffFit_bins.py # My current solution
fit_res_clustering_opt.py # My incorrect optimization try
fit_res_clustering_benchmark_Birch.py # My vectorization try

In my initial post, there was a bug. Somewhere in the middle, I started to center all the molecules before fitting. So, mol_center should always be the origin (0, 0, 0). This then gives 5121 clusters. I then created DiffFit_bins.py, removed mol_center from all calculations, and promoted d2max to be a class attribute. This gave a 31-sec to 23-sec speedup. I consider this to be my current best (and probably my final solution for this issue in the near future).

I tried to use numpy arrays to implement are_coordinates_close, but it took longer, probably because of your early stop and the array construction time.

I also tried to optimize it in accordance with your suggested direction. In fit_res_clustering_opt.py, I implemented the following pseudo-code.

bfactor=1

for each query transform ptf 
    find cbin (the bin that ptf falls in)
    if cbin in bins
        for each transform o in bins[cbin]
            check if ptf is close to o

It gave a 23-sec to 19-sec speedup. But it gave 6473 clusters instead of 5121. I was not sure where the problem might be.

I did some search for C++ integration in ChimeraX but found limited resources. From writing_bundles, sample code link expired, and the two GitHub repos don't have the C++ code or the guidelines. C++ Libraries seems to be not maintained. DiffFit now depends on chimerax-qscore's _kmeans_cpp. But I could not compile this project. I noticed from ChimeraX Recipes, the Running C++ computations in a separate thread, which I also could not compile on my end.

I also gave scikit-learn's Birch a try. The idea is to rotate a unit up vector and a unit right vector by the quaternion. Together with the shift, there will be 3 sets of coordinates to be clustered. This can be finished in 20-sec. These 3 sets can be clustered in parallel to gain more speedup. However, using ChimeraX's clustering result as a standard, I find it hard to set the threshold values to get a similar amount of unique clusters.

Clustering 250000 fits
q2_unit_coord: 0:00:03.817841
mol_shift cluster: 0:00:06.634961
r_up cluster: 0:00:04.555849
r_right cluster: 0:00:04.513924
Total unique: 15554
Birch clustering: 0:00:19.691360

I then did some higher-level filtering to expose a user-controllable parameter to set a maximum number of fits per molecule to be clustered. The default is the top 10,000 fits ranked by occupied density. With this setting, the speed of ChimeraX's bin clustering is fine.

Kindly let me know if you have any comments or suggestions. Thanks again.

tomgoddard · 2024-12-17T19:25:28Z

Your original binned fitting (with default bins 10x bigger than cluster sizes) takes 32.3 seconds for your 250,000 fits on my Mac laptop. Making bin size twice the cluster size (bfactor = 2) reduces the time to 12.7 seconds. Following my suggestion of first checking the center bin for a close transform because almost all of the 250,000 transforms have a binned closed transform and this will catch that quickly, reduces the time to 6.67 seconds, almost a factor of 2 speed up, and it gives the correct number of clusters 5124. Code is attached.

The dead ChimeraX sample code link and blank C++ libraries page should be reported with ChimeraX menu Help / Report a Bug so we can fix those documentation problems. I already reported them, but would be great if you report problems in the future to help make ChimeraX better. I don't recommend multiple C++ threads as it is extremely error prone and the example on ChimeraX recipes is from a non-UCSF developer Tristan and is many years old.

Your improvement of binning only the best scoring fits seems ok.

fit_clustering.zip

RodenLuo · 2024-12-17T20:36:50Z

Thanks, Tom! I unfortunately couldn't reproduce the improvement with the attached code on a Win 11. I tried to comment Line 66 and uncomment Line 67 in fit_res_clustering_benchmark_mod3.py. With Line 67, the time needed increases and the number of clusters changes. Did you include the wrong file?

# Using Line 66 
Clustering 250000 fits
Convert to matrix time: 0:00:01.936995
ChimeraX bin clustering, 5124 clusters: 0:01:12.093816

# Using Line 67 
Clustering 250000 fits
Convert to matrix time: 0:00:01.911899
ChimeraX bin clustering, 5094 clusters: 0:02:15.845920

Sure. I will report things as such using the Help / Report a Bug button.

RodenLuo · 2024-12-17T20:56:39Z

On my way back home, I realized bfactor has to be 1. Sorry and thanks!

tomgoddard · 2024-12-17T23:21:15Z

Oops! My optimized version from 3 comments ago did not measure the translation right. That is why it didn't get the expected 5124 transforms. Also I changed bfactor and commented out the optimization before I sent it to you by accident because I was retiming the original unoptimized version.

Here is the fixed optimization (mod3). I'm on Mac Studio now and the original bfactor = 10 time on this computer is 27.7 seconds, the bfactor = 2 time is 11.5 seconds, and the optimized version time is 6.7 seconds. And all give 5124 clusters as expected.

fit_cluster2.zip

23-sec to 16-sec speedup Still need to incorporate into the code base Still need to understand the inverse operation

0.2-sec speedup

RodenLuo · 2024-12-20T09:05:14Z

Hi Tom, Thanks again! I made a few changes and applied the solution.

Postpend the inverse transform rather than prepend.
I can understand btf = btf * tf.inverse() * tf , thus btf * tf.inverse() represents the transform that brings tf to btf. You have the tf.inverse() postpended in the ChimeraX code base, but prepended in the solution above. Even though the number of clusters is correct, I cannot comprehend why.
Calculate tf.inverse() before looping through all transforms in the bin. This is arguable because there could be an early stop from the distance threshold. In practice, precalculating the inverse is slightly faster, probably because most bin transforms pass the distance threshold.
Removed mol_center related calculations and promoted d2max to be a class attribute.
I copied your bins.py (as there are minor changes spread here and there) to the DiffFit code base and kept all the copyright notices.

I might convert this are_coordinates_close function to C++ after I figure out the C++ build pipeline for Chimerax, as I'm now running profiling on almost the whole PDB/AFDB and a slight speed up may help a lot.

(Misc: The d2max variable in the ChimeraX code base could be refactored as a class attribute. My previous try didn't get it right because I didn't check in the nearby bins if the central bin didn't have a close transform. I only checked if the central bin didn't exist in the bins...)

tomgoddard · 2024-12-20T19:37:53Z

It is a surprising fact that for two rotations Q and R the rotation angle of QR equals the rotation angle of RQ. It would be nice to have an intuitive proof of that fact.

Worrying about where to put the d2max = d * d or where to put the tf.inverse() call is not sensible optimization unless it makes a significant (e.g. > 2%) difference in speed, and I don't think it does in these cases. I suggest keeping the code as easy to read and understand as possible by avoiding doing tiny optimizations.

RodenLuo added a commit that referenced this issue Dec 9, 2024

fit results clustering speed up, a bit; see #44

17f459a

RodenLuo closed this as completed Dec 9, 2024

RodenLuo added a commit that referenced this issue Dec 10, 2024

fallback the buggy clustering; see #44

42f03d9

RodenLuo reopened this Dec 10, 2024

RodenLuo added a commit that referenced this issue Dec 17, 2024

DiffFit_bins class to speed up clustering; #44

bafaf8b

no mol_center related calculations promoted d2max to be a class attribute 31-sec to 23-sec speedup for the case in #44

RodenLuo added a commit that referenced this issue Dec 17, 2024

test code for fit_res clustering speed up #44

ccdea6c

RodenLuo added a commit that referenced this issue Dec 18, 2024

correct early stop and correct catch if not stopped in clustering; #44

58856ef

23-sec to 16-sec speedup Still need to incorporate into the code base Still need to understand the inverse operation

RodenLuo added a commit that referenced this issue Dec 18, 2024

put inverse calculation outside the for loop; test for #44

a2e237a

0.2-sec speedup

RodenLuo added a commit that referenced this issue Dec 19, 2024

inverse transform should be postpended instead of prepended; #44

f85f067

RodenLuo added a commit that referenced this issue Dec 19, 2024

typo in test code for #44

bd558cb

RodenLuo added a commit that referenced this issue Dec 19, 2024

move clustering solution from test playground to the main code base; #44

a310357

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize/Parallelize the clustering algorithm for the fit results #44

Vectorize/Parallelize the clustering algorithm for the fit results #44

RodenLuo commented Dec 3, 2024

tomgoddard commented Dec 3, 2024

RodenLuo commented Dec 4, 2024

tomgoddard commented Dec 4, 2024

RodenLuo commented Dec 4, 2024

RodenLuo commented Dec 4, 2024

tomgoddard commented Dec 6, 2024

RodenLuo commented Dec 6, 2024

tomgoddard commented Dec 6, 2024

RodenLuo commented Dec 9, 2024

tomgoddard commented Dec 9, 2024

RodenLuo commented Dec 10, 2024

tomgoddard commented Dec 11, 2024

RodenLuo commented Dec 17, 2024

tomgoddard commented Dec 17, 2024

RodenLuo commented Dec 17, 2024

RodenLuo commented Dec 17, 2024

tomgoddard commented Dec 17, 2024

RodenLuo commented Dec 20, 2024

tomgoddard commented Dec 20, 2024

Vectorize/Parallelize the clustering algorithm for the fit results #44

Vectorize/Parallelize the clustering algorithm for the fit results #44

Comments

RodenLuo commented Dec 3, 2024

tomgoddard commented Dec 3, 2024

RodenLuo commented Dec 4, 2024

tomgoddard commented Dec 4, 2024

RodenLuo commented Dec 4, 2024

RodenLuo commented Dec 4, 2024

tomgoddard commented Dec 6, 2024

RodenLuo commented Dec 6, 2024

tomgoddard commented Dec 6, 2024

RodenLuo commented Dec 9, 2024

tomgoddard commented Dec 9, 2024

RodenLuo commented Dec 10, 2024

tomgoddard commented Dec 11, 2024

RodenLuo commented Dec 17, 2024

tomgoddard commented Dec 17, 2024

RodenLuo commented Dec 17, 2024

RodenLuo commented Dec 17, 2024

tomgoddard commented Dec 17, 2024

RodenLuo commented Dec 20, 2024

tomgoddard commented Dec 20, 2024