[BUG] fused_l2_nn_argmin produces unexpected results #1036

benfred · 2022-11-18T17:51:16Z

Describe the bug
Running fused_l2_nn_argmin sometimes produces results that appear incorrect.

As a trivial test case, passing the same matrix for both inputs - sometimes returns results where each row isn't
it's own nearest neighbour.

Steps/Code to reproduce bug

import cupy as cp
from numpy.testing import assert_array_equal
from pylibraft.distance import fused_l2_nn_argmin


X = cp.array([[0.15804534, 0.70286853, 0.1001605, 0.30935687, 0.44620642],
 [0.40115238, 0.36828412, 0.58650847, 0.26410076, 0.9525248 ],
 [0.11997995, 0.85193706, 0.27353832, 0.13150834, 0.98335868],
 [0.21238154, 0.34168486, 0.94976721, 0.33517375, 0.41023301]])

ids = fused_l2_nn_argmin(X, X, sqrt=True).copy_to_host()

# ids are [0, 2, 2, 3] on my system - but I would expect each row to be its own nearest neighbour [0, 1, 2, 3] 
assert_array_equal(ids, np.arange(4))

Environment details (please complete the following information):

Environment location: Bare-metal
Method of RAFT install: source
Cuda 11.4 / A6000 GPU

The text was updated successfully, but these errors were encountered:

cjnolet · 2022-11-18T19:33:08Z

cc @Nyrio have you encountered this behavior in your dealings with the fused l2 primitive?

cjnolet · 2022-11-18T21:17:44Z

@benfred it looks like it might be related to the sqrt here:

>>> X = cp.random.random((100, 5)).astype(cp.float32)
>>> ids = fused_l2_nn_argmin(X, X, sqrt=False).copy_to_host()
>>> ids
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99],
      dtype=int32)
>>> ids = fused_l2_nn_argmin(X, X, sqrt=True).copy_to_host()
>>> ids
array([ 0, 74,  2,  3,  4,  5,  6, 79,  8,  9, 10, 11, 12, 13, 14, 63, 16,
       17, 18, 19, 54, 95, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 34, 36, 37, 48, 39, 40, 41, 56, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 26, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73,  1, 75, 76, 57, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 22, 92, 93, 94, 95, 96, 97, 98, 99],
      dtype=int32)
>>>

Nyrio · 2022-11-21T10:10:52Z

@benfred Thanks for reporting this bug!
@cjnolet I haven't noticed that before, but can have a closer look.

… x and y (#1040) Solves #1036 Even when computing a sum of squares, the distance from a point to itself can apparently be `-0.0` in which case the square root is `nan` and comparisons are broken. Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Ben Frederickson (https://github.com/benfred) - Corey J. Nolet (https://github.com/cjnolet) URL: #1040

Nyrio · 2023-10-17T09:38:58Z

@cjnolet What's the reason for reopening?

cjnolet · 2023-10-17T14:25:57Z

@Nyrio we noticed yesterday that our Google tests haven't been running in CI since July because a change was made to the CI script causing them the runner to fail silently.

Unfortunately, there's now some failures in the fused l2 nn tests and the ivf pq tests. We think the fused l2 nn is likely causing the ivf pq errors (through the balanced kmeans).

I'm not sure yet if it's the round off errors from this issue that are the culprit. Our python fused p2 nn tests are passing but the Google tests are not and it looks like it's more issues where a neigbor's distance is supposed to be 0 but it's nonzero instead.

What I found last night is that ether approximation in the polynomial expansion can cause this.

For example, let's say we have a self looped distance, which should be 0. Let's say the norm for X and Y vectors are 2.54321. Those will match because we compute the norms using the same function. The dot product, however, could be 2.54322 because of round off errors. What I notice happening is that the difference between the norms and the dot product will leave something like 0.00001 instead of true 0 and once we take the sqrt of that, we get something like 0.00316 instead of 0.

Nyrio · 2023-10-17T14:28:42Z

Ok, thanks for explaining.

cjnolet · 2023-10-18T20:45:40Z

This should be fixed once and for all in this PR: #1904. I think the changes in that PR also beg for potential future optimizations. That round-off error is not easy to deal with.

benfred added the bug Something isn't working label Nov 18, 2022

benfred mentioned this issue Nov 18, 2022

Expose cluster_cost to python #1028

Merged

cjnolet added this to VS/ML/DM Primitives Release Board Nov 18, 2022

Nyrio mentioned this issue Nov 22, 2022

[HOTFIX] Fix fusedL2NN bug that can happen when the same point appears in both x and y #1040

Merged

benfred linked a pull request Nov 22, 2022 that will close this issue

[HOTFIX] Fix fusedL2NN bug that can happen when the same point appears in both x and y #1040

Merged

rapids-bot bot closed this as completed in #1040 Nov 23, 2022

rapids-bot bot moved this to Done in VS/ML/DM Primitives Release Board Nov 23, 2022

cjnolet reopened this Oct 17, 2023

cjnolet closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] fused_l2_nn_argmin produces unexpected results #1036

[BUG] fused_l2_nn_argmin produces unexpected results #1036

benfred commented Nov 18, 2022 •

edited

Loading

cjnolet commented Nov 18, 2022

cjnolet commented Nov 18, 2022

Nyrio commented Nov 21, 2022

Nyrio commented Oct 17, 2023

cjnolet commented Oct 17, 2023

Nyrio commented Oct 17, 2023

cjnolet commented Oct 18, 2023

[BUG] fused_l2_nn_argmin produces unexpected results #1036

[BUG] fused_l2_nn_argmin produces unexpected results #1036

Comments

benfred commented Nov 18, 2022 • edited Loading

cjnolet commented Nov 18, 2022

cjnolet commented Nov 18, 2022

Nyrio commented Nov 21, 2022

Nyrio commented Oct 17, 2023

cjnolet commented Oct 17, 2023

Nyrio commented Oct 17, 2023

cjnolet commented Oct 18, 2023

benfred commented Nov 18, 2022 •

edited

Loading