Runtime Error in cdist #601

Inzlinger · 2020-06-22T09:24:24Z

Description
When running cdist on distributed tensors a runtime error occcurs.

Traceback (most recent call last):
File "demo_knn.py", line 131, in
print(verify_algorithm(X, Y, 5, 30, 5))
File "demo_knn.py", line 127, in verify_algorithm
result_y = classifier.predict(verification_x)
File "/home/jakob/neural/heat/heat/classification/knn.py", line 44, in predict
distances = ht.spatial.cdist(X, self.x)
File "/home/jakob/neural/heat/heat/spatial/distance.py", line 128, in cdist
return _dist(X, Y, _euclidian)
File "/home/jakob/neural/heat/heat/spatial/distance.py", line 383, in _dist
d._DNDarray__array[:, cols[0] : cols[1]] = d_ij
RuntimeError: The expanded size of the tensor (8) must match the existing size (11) at non-singleton dimension 1. Target sizes: [30, 8]. Tensor sizes: [27, 11]

To Reproduce
On branch https://github.com/helmholtz-analytics/heat/tree/features/556-assign_label
run "mpirun -n 4 demo_knn.py " (in folder heat/examples/classification)

Version Info
On branch https://github.com/helmholtz-analytics/heat/tree/features/556-assign_label

Cdebus · 2020-06-25T07:50:20Z

Uh, interessting. I will have a look at it, thanks for raising this

Cdebus · 2020-07-14T06:43:42Z

Ok, had a look into it... It is arguable whether this is a bug.
The reason this breaks is due to the fact that your create_fold function does not yield balanced DNDarray. However, for the caculation of the tiles, _dist assume a balanced array (the tile locations/sizes are calculated via comm.counts_displs_shape )
Running .balance() on the fold and verification arrays solves the problem (this needs to be done after slicing indices in knn.py : predict l.65 as well, by the way, there are some more problems to this example).

However, we should discuss whether the distance functions should allow for unbalanced arrays. This would require additional communication overhead though. @Markus-Goetz what do you say?

Markus-Goetz · 2020-07-14T07:05:56Z

I think calling balance is the way to go here, because we also want to ensure that computations are more or less equally distributed.

The create folds function could be improved here. Instead of taking a global slice, leaving the last processes empty handed, we could simply do a local slice, and sticht the results back together into a global DNDarray. This way we also avoid heavy communication.

Cdebus · 2020-07-14T07:14:59Z

That is what @Inzlinger is doing actually (calling the ht.array(torch.tensor, is_split=0) However, this does not result in balanced DNDarrays (something I also discovered when doing the kmeans test enhancement).
So maybe that is the point to look at.
I agree though, that we should leave cdist as is, since the official heat policy is "balanced in, balanced out" as far as I recall.

Markus-Goetz · 2020-07-14T07:40:44Z

Sorry my bad. Did not look into it deep enough. In this case calling balance_() should be sufficient and actually not encompass to much communication.

Correct about the balanced in, balanced out part.

Cdebus · 2020-07-14T08:02:27Z

Brilliant, than I can close this.
@Inzlinger I added the following to your create_fold function (l. 104) and pushed the changes

    fold_x.balance_()
    fold_y.balance_()
    verification_y.balance_()
    verification_x.balance_()
    return fold_x, fold_y, verification_x, verification_y

Also, your predict function in knn.py has a balancing problem, l.65 cause it because the slicing results in some processes not having any more data. But I didn't look further into that

Inzlinger added the bug Something isn't working label Jun 22, 2020

Inzlinger assigned Markus-Goetz and Cdebus Jun 22, 2020

ClaudiaComito modified the milestones: 2-week sprint, v0.5.0 Jun 30, 2020

ClaudiaComito removed this from the v0.5.0 milestone Jul 2, 2020

Inzlinger mentioned this issue Jul 6, 2020

Features/556 assign label #620

Merged

1 task

Cdebus closed this as completed Jul 14, 2020

Cdebus pushed a commit that referenced this issue Jul 14, 2020

Fixed #601 cdist bug by balancing create_fold arrays

0c06a30

This was referenced Jul 21, 2020

Features/556 assign label #637

Closed

Features/556 assign label #639

Merged

Markus-Goetz pushed a commit that referenced this issue Jul 27, 2020

Fixed #601 cdist bug by balancing create_fold arrays

5d56d8a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime Error in cdist #601

Runtime Error in cdist #601

Inzlinger commented Jun 22, 2020

Cdebus commented Jun 25, 2020

Cdebus commented Jul 14, 2020

Markus-Goetz commented Jul 14, 2020

Cdebus commented Jul 14, 2020

Markus-Goetz commented Jul 14, 2020

Cdebus commented Jul 14, 2020 •

edited

Loading

Runtime Error in cdist #601

Runtime Error in cdist #601

Comments

Inzlinger commented Jun 22, 2020

Cdebus commented Jun 25, 2020

Cdebus commented Jul 14, 2020

Markus-Goetz commented Jul 14, 2020

Cdebus commented Jul 14, 2020

Markus-Goetz commented Jul 14, 2020

Cdebus commented Jul 14, 2020 • edited Loading

Cdebus commented Jul 14, 2020 •

edited

Loading