-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clustering.py #5
Comments
I tried your examples and found that your method performs better. However the way the center is updated each time can be changed by max_centers[labels[i]] = onp.where(max_centers[labels[i]] > X[i], max_centers[labels[i]], X[i]) , which is the maximum coordinate of each dimension of each cluster. |
Hello @WangKehan573, Yes, I will update the code to vectorize the center calculation. However, I am not sure why that should eliminate the need for shuffling. Currently, the clustering algorithm is dependent on the order of the geometries as the centers are updated after each label update unlike the original k-means where the centers are updated after reassigning labels for all geometries. So, shuffling might not help the performance as much but I added it to the code to improve robustness. I will perform new tests to investigate the effects of shuffling. |
Great Thanks to your answer. The code I modified is as followed:(Do not update the center in assgin_labels(), Do it in calc_cost()!) @numba.njit
def modified_kmeans(systems,k=3,max_iterations=100, rep_count=10, print_mode=True):
|
I noticed that during the implementation of the cluster algorithm, line 82-84:
for i in range(len(counts)):
if counts[i] == 0:
centroids[i,:] = 0.0
This shows that the center number of the initial selection may be greater than the number of clusters. The root cause is that the definition of distance in the original paper does not meet the definition of mathematics. In mathematics, the distance that requires one point to itself is equal to 0. This is not reflected in the code.
In addition, the order of the geometries in the algorithm will affect the generation of the new center points, thereby affecting the next iterations of clustering. The approach adopted in the original text is that input geometries are shuffled after each iteration for randomization.
The purpose of this cluster is pre-calculation of interaction lists that remain static throughout optimization (as described in the previous subsection), clustering of input geometries with similar computational demands together (explained below) and alignment of the interaction lists of geometries in the same cluster (by padding as necessary) for efficient memory accesses. The essence of this cluster is to find the division of a collection, so that the sum of the maximum value of each dimension to the maximum value of each dimension and the amount of the number of subset elements is the least.
And the approach I tried is to convert (n1, n2, n3, n4, n5) to (n2, n3, n4, n5n1n1)).
n1, n2, n3, n4 and n5 are the numbers of atoms, 2-body interactions, 3-body interactions, 4-body interactions and periodic boxes.
Then directly perform KMETRIODS ++ cluster according to the Manhattan distance, and select a group with the smallest calculation cost multiple times. In the case where the sample volume is large, I can generally use less time to cluster and get a more robust result, and the calculation costs with smaller costs.
And as a traditional machine learning method, it meets the definition of mathematics and avoids the situation of multiple new center points and the situation you wanna avoid in 82-84 line.
The text was updated successfully, but these errors were encountered: