-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize/Parallelize the clustering algorithm for the fit results #44
Comments
Are you sure the binning code is the bottleneck? The ChimeraX code is simple and not intended for highest performance. It simply hashes the translation and rotation angle in bins of the requested size (translation step size and angle step size). That should be very fast. The way the ChimeraX fitmap search command uses it is that each time a new fit is computed it checks if it is near a previous fit. If it is near a previous one it does not add it to the binning. This way the bins don't fill up with fits. If you let the bins fill up with fits and then start doing lookups of all close transforms it could get slow. |
Thanks, Tom! DiffFit first optimizes all fits in parallel. When it reaches clustering, it loops through all filtered fits (e.g., by I list a quick benchmark below. (The second column records the time to convert from an array of shifts and quaternions to
The clustering is for sure not the longest computation. Optimizing 500 shifts x 500 quaternions = 250,000 fits takes 4.5 mins (can be further improved, as currently, it involves a saving of all parameters every 10 iterations). But I feel the clustering should be done in less than a second. |
I agree that is slow, about 3500 fits/second for bin clustering (=250000 / 72.2). Numpy is slow creating and operating on small arrays such as the 3x4 transform matrix used by Place. If you attached your Python benchmark script I might try it. Of course you can profile Python code and find out where the time is going. I agree this could run 100 to 1000 times faster if implemented for speed. But the first step is to figure out where all the time is going in the current unoptimized Python implementation. |
The attached fit_res_clustering_benchmark.zip should be roughly the minimal code and data to reproduce the first line in the table. Put the contents of the zip file in the working directory (usually your Desktop). And run
You may uncomment Line 17 in
Thanks! |
The script in text form: from scipy.spatial.transform import Rotation as R
from chimerax.geometry import bins
from chimerax.geometry import Place
import numpy as np
from datetime import datetime
from math import pi
fit_res = np.load("fit_res_filtered.npz")
fit_res_filtered = fit_res['fit_res_filtered']
mol_center = fit_res['mol_center']
mol_shift = fit_res_filtered[:, :3]
mol_q = fit_res_filtered[:, 3:7]
Total_fits = len(mol_shift)
# Total_fits = 15475 # to mimic other records in the reported table
print(f"Clustering {Total_fits} fits")
timer_start = datetime.now()
T = []
for i in range(Total_fits):
shift = mol_shift[i]
quat = mol_q[i]
R_matrix = R.from_quat(quat).as_matrix()
T_matrix = np.zeros([3, 4])
T_matrix[:, :3] = R_matrix
T_matrix[:, 3] = shift
transformation = Place(matrix=T_matrix)
T.append(transformation)
print(f"Convert to matrix time: {datetime.now() - timer_start}")
timer_start = datetime.now()
angle_tolerance = 6.0
shift_tolerance = 3.0
b = bins.Binned_Transforms(angle_tolerance * pi / 180, shift_tolerance, mol_center)
mol_transform_label = []
unique_id = 0
T_ID_dict = {}
for i in range(Total_fits):
ptf = T[i]
close = b.close_transforms(ptf)
if len(close) == 0:
b.add_transform(ptf)
mol_transform_label.append(unique_id)
T_ID_dict[id(ptf)] = unique_id
unique_id = unique_id + 1
else:
mol_transform_label.append(T_ID_dict[id(close[0])])
T_ID_dict[id(ptf)] = T_ID_dict[id(close[0])]
print(f"ChimeraX bin clustering: {datetime.now() - timer_start}") |
I my Mac Studio M2 Ultra your script takes 27 seconds to bin the 250,000 fits where your machine takes 76 seconds. There are only 5124 clusters, so averaging 50 fits per cluster with your 6 degree, 3 A cluster size. There is an additional parameter to the Binned_Transforms() called bfactor, default 10, which is how much larger the hashed bins are than the angle and translation size. The large bins 60 degrees and 30 A are going to put a lot of fits in one bin, and then the code searches all of those in a bin for close ones which will be slow. If I use bfactor = 1 or 2 in the constructor the total time reduces to 12 seconds, a factor of 2 speed up. You might think it would be faster but a couple factors slow it down. First you have many fits that are nearly identical (within 1 degree, and 1 Angstrom), so the smaller bins still have a ton of fits. And then there is the extra cost that with smaller bins, the code may be looking at 16 times more neighbor bins (the bins are 4D, rotation angle + x,y,z translation axes) and those hash lookups start to take time. Another slow aspect of the code is that it finds all close transforms instead of just finding if there is at least one. I tested how fast it might be if it just found at least one transform in the same bin and total time was 4 seconds. So it is probably possible to make small improvements to the Python code and make it 7x faster for your use case. That is still very slow, but is not too surprising given that it is Python and not vectorized. I'd say an optimized C++ version would be about 300x faster than the current Python. But as usual it is a tradeoff between slow and simple with little time developing the code, or fast and complex and a lot of time developing the code. |
How did you achieve this? Many thanks! Logic-wise, I use a dictionary with I thus extend class DiffFit_Binned_Transforms(Binned_Transforms):
def __init__(self, angle, translation, center=(0, 0, 0), bfactor=2):
super().__init__(angle, translation, center, bfactor)
def closest_transforms(self, tf):
a, x, y, z = c = self.bin_point(tf)
clist = self.bins.close_objects(c, self.spacing)
if len(clist) == 0:
return None
itf = tf.inverse()
d2max = self.translation * self.translation
for ctf in clist:
cx, cy, cz = ctf * self.center
dx, dy, dz = x - cx, y - cy, z - cz
d2 = dx * dx + dy * dy + dz * dz
if d2 <= d2max:
dtf = ctf * itf
a = dtf.rotation_angle()
if a < self.angle:
return ctf
return None |
Hi Roden, Your DiffFit_Binned_Transforms doesn't speed things up much because all the time is spent in self.bins.close_objects(c, self.spacing) and you didn't change that code. The 4 second time I mentioned was not a complete solution, it merely checked if the bin containing the transform had another transform and used bfactor = 1 so that the bin size was the 6 degrees and 3 Angstrom tolerance. That quickly determines that a nearby transform exists. But if the bin does not have a transform, then we fallback to the slow method. I've attached the modifications to your benchmark script I tried. You are right you are not going to get clusters defined by the nearest transforms with the existing code -- I wrote it for ChimeraX to maximize simplicity without sacrificing speed by insisting on a nice conceptual definition of clusters. These trade-offs are typical of clustering methods. I hope you understand ChimeraX is full of "good enough" simple to code solutions. You can always make better solutions with more complex code. But we try to provide the best value for researchers and that means not spending 5x more coding, debugging, maintenance time to produce a somewhat nicer solution. Tom |
Hi Tom, Many thanks!! I went through the two related classes a bit and realized your speed-up method could be a complete solution as you have already stored the first Place object in the newly added bin ( I totally understand for such a giant software system, there are many factors that need to be considered, and trade-offs need to be made. ChimeraX, in my humble opinion, is a great tool. I have learned a lot from it. Once again, very appreciated, and many thanks! With the current speed, I wouldn't say the clustering is so rate-limiting and needs speed up. Projection to the future, I might come back if the other parts get significant speed up later. I do have a question then. If I understand the code correctly, only one float number is used to encode the rotation, Very best, |
Hi Roden, I don't think your code is correct. You should get exactly the same number 5124 clusters. I suspect the problem is that finding a transform already in a bin does not guarantee that it is near the query transform since as you noted only the rotation angle is binned, not the rotation axis. So your current code is deciding not to make a new cluster based on the inadequate test that there is already a transform in its bin. It needs a further test that some transform in that bin is actually close to the query transform. It is of course possible to bin using the rotation axis and angle. But my code only uses the angle and then checks the elements in the bin to see if they are actually close (done in the Binned_Transforms.close_transforms() method). Why did I do it this way? Because it is faster running and a bit simpler. Tom |
Aha, I see now. Even though Many thanks! |
The transforms in a bin are not necessarily even remotely close. They have the same translation and rotation angle, but maybe the rotation angle is 90 degrees and the axes of two rotations are 90 degrees apart so the rotations are completely different. The binning is for fast lookup. The bins are not intended to be close transforms. They are just used to greatly reduce the number of transforms that need to be compared. Still the optimization where you start by looking in the bin that the query transform would be in and stop as soon as you find one transform within the threshold distances will probably significantly accelerate the search and is pretty easy to code. In optimizing you usually try such easy improvements first and only move on to harder optimizations if those don't give enough speed-up. |
no mol_center related calculations promoted d2max to be a class attribute 31-sec to 23-sec speedup for the case in #44
Hi Tom, Thanks for the guidance! The attached DiffFit_bins.zip includes my progress towards this issue. It has the following files.
In my initial post, there was a bug. Somewhere in the middle, I started to center all the molecules before fitting. So, I tried to use numpy arrays to implement I also tried to optimize it in accordance with your suggested direction. In
It gave a 23-sec to 19-sec speedup. But it gave 6473 clusters instead of 5121. I was not sure where the problem might be. I did some search for C++ integration in ChimeraX but found limited resources. From writing_bundles, sample code link expired, and the two GitHub repos don't have the C++ code or the guidelines. C++ Libraries seems to be not maintained. DiffFit now depends on chimerax-qscore's _kmeans_cpp. But I could not compile this project. I noticed from ChimeraX Recipes, the Running C++ computations in a separate thread, which I also could not compile on my end. I also gave scikit-learn's Birch a try. The idea is to rotate a unit up vector and a unit right vector by the quaternion. Together with the shift, there will be 3 sets of coordinates to be clustered. This can be finished in 20-sec. These 3 sets can be clustered in parallel to gain more speedup. However, using ChimeraX's clustering result as a standard, I find it hard to set the threshold values to get a similar amount of unique clusters.
I then did some higher-level filtering to expose a user-controllable parameter to set a maximum number of fits per molecule to be clustered. The default is the top 10,000 fits ranked by occupied density. With this setting, the speed of ChimeraX's bin clustering is fine. Kindly let me know if you have any comments or suggestions. Thanks again. |
Your original binned fitting (with default bins 10x bigger than cluster sizes) takes 32.3 seconds for your 250,000 fits on my Mac laptop. Making bin size twice the cluster size (bfactor = 2) reduces the time to 12.7 seconds. Following my suggestion of first checking the center bin for a close transform because almost all of the 250,000 transforms have a binned closed transform and this will catch that quickly, reduces the time to 6.67 seconds, almost a factor of 2 speed up, and it gives the correct number of clusters 5124. Code is attached. The dead ChimeraX sample code link and blank C++ libraries page should be reported with ChimeraX menu Help / Report a Bug so we can fix those documentation problems. I already reported them, but would be great if you report problems in the future to help make ChimeraX better. I don't recommend multiple C++ threads as it is extremely error prone and the example on ChimeraX recipes is from a non-UCSF developer Tristan and is many years old. Your improvement of binning only the best scoring fits seems ok. |
Thanks, Tom! I unfortunately couldn't reproduce the improvement with the attached code on a Win 11. I tried to comment Line 66 and uncomment Line 67 in
Sure. I will report things as such using the Help / Report a Bug button. |
On my way back home, I realized bfactor has to be 1. Sorry and thanks! |
Oops! My optimized version from 3 comments ago did not measure the translation right. That is why it didn't get the expected 5124 transforms. Also I changed bfactor and commented out the optimization before I sent it to you by accident because I was retiming the original unoptimized version. Here is the fixed optimization (mod3). I'm on Mac Studio now and the original bfactor = 10 time on this computer is 27.7 seconds, the bfactor = 2 time is 11.5 seconds, and the optimized version time is 6.7 seconds. And all give 5124 clusters as expected. |
23-sec to 16-sec speedup Still need to incorporate into the code base Still need to understand the inverse operation
Hi Tom, Thanks again! I made a few changes and applied the solution.
I might convert this (Misc: The d2max variable in the ChimeraX code base could be refactored as a class attribute. My previous try didn't get it right because I didn't check in the nearby bins if the central bin didn't have a close transform. I only checked if the central bin didn't exist in the bins...) |
It is a surprising fact that for two rotations Q and R the rotation angle of QR equals the rotation angle of RQ. It would be nice to have an intuitive proof of that fact. Worrying about where to put the d2max = d * d or where to put the tf.inverse() call is not sensible optimization unless it makes a significant (e.g. > 2%) difference in speed, and I don't think it does in these cases. I suggest keeping the code as easy to read and understand as possible by avoiding doing tiny optimizations. |
Hi Tom @tomgoddard,
DiffFit has been calling the same clustering algorithm as the fitmap command in ChimeraX. The algorithm relies on the
Binned_Transforms
class and itsclose_transforms
andadd_transform
methods fromchimerax.geometry.bins
. This step is observed to be a rate-limiting step in quite some cases and I'm tempted to speed it up.Before I knew the built-in
chimerax.geometry.bins
, I once implemented an algorithm to approximately solve it:It was buggy and generated more clusters than
chimerax.geometry.bins
. Often, two clusters were actually within the threshold and thus should be just one cluster. (I unfortunately could not figure out why...)I wonder if you have any idea how to speed up the clustering. Ideally, the required computation should be vectorized so that it can be done on PyTorch. Or at least, it should be possible to parallelize it through, for example, OpenMP.
Many thanks!
The text was updated successfully, but these errors were encountered: