Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime Error in cdist #601

Closed
Inzlinger opened this issue Jun 22, 2020 · 6 comments
Closed

Runtime Error in cdist #601

Inzlinger opened this issue Jun 22, 2020 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@Inzlinger
Copy link
Collaborator

Description
When running cdist on distributed tensors a runtime error occcurs.

Traceback (most recent call last):
File "demo_knn.py", line 131, in
print(verify_algorithm(X, Y, 5, 30, 5))
File "demo_knn.py", line 127, in verify_algorithm
result_y = classifier.predict(verification_x)
File "/home/jakob/neural/heat/heat/classification/knn.py", line 44, in predict
distances = ht.spatial.cdist(X, self.x)
File "/home/jakob/neural/heat/heat/spatial/distance.py", line 128, in cdist
return _dist(X, Y, _euclidian)
File "/home/jakob/neural/heat/heat/spatial/distance.py", line 383, in _dist
d._DNDarray__array[:, cols[0] : cols[1]] = d_ij
RuntimeError: The expanded size of the tensor (8) must match the existing size (11) at non-singleton dimension 1. Target sizes: [30, 8]. Tensor sizes: [27, 11]

To Reproduce
On branch https://github.com/helmholtz-analytics/heat/tree/features/556-assign_label
run "mpirun -n 4 demo_knn.py " (in folder heat/examples/classification)

Version Info
On branch https://github.com/helmholtz-analytics/heat/tree/features/556-assign_label

@Inzlinger Inzlinger added the bug Something isn't working label Jun 22, 2020
@Cdebus
Copy link
Contributor

Cdebus commented Jun 25, 2020

Uh, interessting. I will have a look at it, thanks for raising this

@ClaudiaComito ClaudiaComito modified the milestones: 2-week sprint, v0.5.0 Jun 30, 2020
@ClaudiaComito ClaudiaComito removed this from the v0.5.0 milestone Jul 2, 2020
@Cdebus
Copy link
Contributor

Cdebus commented Jul 14, 2020

Ok, had a look into it... It is arguable whether this is a bug.
The reason this breaks is due to the fact that your create_fold function does not yield balanced DNDarray. However, for the caculation of the tiles, _dist assume a balanced array (the tile locations/sizes are calculated via comm.counts_displs_shape )
Running .balance() on the fold and verification arrays solves the problem (this needs to be done after slicing indices in knn.py : predict l.65 as well, by the way, there are some more problems to this example).

However, we should discuss whether the distance functions should allow for unbalanced arrays. This would require additional communication overhead though. @Markus-Goetz what do you say?

@Markus-Goetz
Copy link
Member

I think calling balance is the way to go here, because we also want to ensure that computations are more or less equally distributed.

The create folds function could be improved here. Instead of taking a global slice, leaving the last processes empty handed, we could simply do a local slice, and sticht the results back together into a global DNDarray. This way we also avoid heavy communication.

@Cdebus
Copy link
Contributor

Cdebus commented Jul 14, 2020

That is what @Inzlinger is doing actually (calling the ht.array(torch.tensor, is_split=0) However, this does not result in balanced DNDarrays (something I also discovered when doing the kmeans test enhancement).
So maybe that is the point to look at.
I agree though, that we should leave cdist as is, since the official heat policy is "balanced in, balanced out" as far as I recall.

@Markus-Goetz
Copy link
Member

Sorry my bad. Did not look into it deep enough. In this case calling balance_() should be sufficient and actually not encompass to much communication.

Correct about the balanced in, balanced out part.

@Cdebus
Copy link
Contributor

Cdebus commented Jul 14, 2020

Brilliant, than I can close this.
@Inzlinger I added the following to your create_fold function (l. 104) and pushed the changes

    fold_x.balance_()
    fold_y.balance_()
    verification_y.balance_()
    verification_x.balance_()
    return fold_x, fold_y, verification_x, verification_y

Also, your predict function in knn.py has a balancing problem, l.65 cause it because the slicing results in some processes not having any more data. But I didn't look further into that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants