Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] IVF flat incorrect distance values with int8 data #1058

Closed
tfeher opened this issue Nov 30, 2022 · 1 comment
Closed

[BUG] IVF flat incorrect distance values with int8 data #1058

tfeher opened this issue Nov 30, 2022 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@tfeher
Copy link
Contributor

tfeher commented Nov 30, 2022

Describe the bug
The distance values returned with IVF flat search method might be incorrect in certain cases

IVF flat search is used by ANN refinement (if the dataset is in device memory) #1052. The Python tests of refinement revealed this problem. This only happens with int8 data (and not with uint8). IVF flat search has specific kernels for different data types, this might point towards a bug in the int8 specialization.

Note: the returned neighbor indices are correct, only the distance values show this problem.

Steps/Code to reproduce bug
To reproduce the issue:

  1. Uncomment the following lines
    https://github.com/tfeher/raft/blob/c34952ba42ca313e280f89fe4a0bc6a495a03852/python/pylibraft/pylibraft/test/test_refine.py#L127-L128

  2. Run the refinement test

pytest pylibraft/test/test_refine.py -k test_refine_dtypes

Output:

>       assert np.mean(diff) < eps
E       assert 0.41265658 < 0.001
E        +  where 0.41265658 = <function mean at 0x14cd30085e50>(array([[0.46209675, 0.3915514 , 0.41272217, 0.41572857, 0.40346152,\n        0.37558016, 0.39662588, 0.36034888, 0.3658...373, 0.33194792, 0.20105657,\n        0.38698354, 0.3945595 , 0.40543446, 0.3377944 , 0.3658476 ]],\n      dtype=float32))
E        +    where <function mean at 0x14cd30085e50> = np.mean

pylibraft/test/test_ivf_pq.py:79: AssertionError
========================================================================================================== short test summary info ==========================================================================================================
FAILED pylibraft/test/test_refine.py::test_refine_dtypes[device-int8-l2_expanded-True-100] - assert 0.41855982 < 0.001
FAILED pylibraft/test/test_refine.py::test_refine_dtypes[device-int8-l2_expanded-True-1024] - assert 0.41032004 < 0.001
FAILED pylibraft/test/test_refine.py::test_refine_dtypes[device-int8-l2_expanded-True-37] - assert 0.40580684 < 0.001
FAILED pylibraft/test/test_refine.py::test_refine_dtypes[device-int8-l2_expanded-False-100] - assert 0.41458178 < 0.001
FAILED pylibraft/test/test_refine.py::test_refine_dtypes[device-int8-l2_expanded-False-1024] - assert 0.41299787 < 0.001
FAILED pylibraft/test/test_refine.py::test_refine_dtypes[device-int8-l2_expanded-False-37] - assert 0.41265658 < 0.001
========================================================================================== 6 failed, 42 passed, 24 skipped, 20 deselected in 7.37s ==========================================================================================

Expected behavior
Passing tests with int8 data.

Additional context
The current C++ unit test only inspect the distance values if the neighbor indices are different.

auto operator==(const idx_dist_pair<IdxT, DistT, CompareDist>& a) const -> bool
{
if (idx == a.idx) return true;
if (eq_compare(dist, a.dist)) return true;
return false;

The unit test shall be extended with a test of the actual distance values. The actual distance between the neighbor and the query vector shall be inspected, and compared to a naive distance calculation. Such test is done in the IVF-PQ python tests and in the refinment test. But such test is not yet applied for the IVF-flat results.

Note, even IVF PQ had an issue with the distance values, but that was fixed in #892

@tfeher tfeher added the bug Something isn't working label Nov 30, 2022
rapids-bot bot pushed a commit that referenced this issue Jan 10, 2023
Solves #1058 

This is a tricky bug so the fix deserves some explanation. The previous implementation of `euclidean_dist` was the following in vectorized cases, where `x` and `y` are `int32` vectors of 4 `int8` each and `acc` is a single `int32` number to accumulate the distance in:

```c++
// Compute vectorized absolute differences independently.
const auto diff = static_cast<int32_t>(__vabsdiffs4(x, y));
// Square, reduce, and add to the accumulator.
acc = dp4a(diff, diff, acc);
```

Now consider the following case:

```c++
x = 0x80; // -128, 0, 0, 0
y = 0x7f; //  127, 0, 0, 0
```

The difference between -128 and 127 is 255, represented as `FF` (`__vabsdiffs4` is smart enough not to compute `abs(a-b)` which would result in `01`). However, if we call the signed version of `dp4a`, `FF` is cast from `int8` to `int32` as `FFFFFFFF` (or -1). The square of -1 is 1, which is added to `acc` (instead of 65025).

As the output of `__vabsdiffs4` is correct when considered as an unsigned number, and as addition is the same for signed and unsigned in 2's complement (and `acc` is positive anyway), the easiest fix is to use the unsigned version of `dp4a`, which will cast overflowed differences properly to 32 bits. The previous code simply becomes:

```c++
const auto diff = __vabsdiffs4(x, y);
acc = dp4a(diff, diff, static_cast<uint32_t>(acc));
```

-----

Additionally, to avoid underflows in the non-vectorized unsigned case, I replaced the subtraction with `__usad` (absolute difference of unsigned numbers). Note that using the subtraction was correct anyway, because the addition/subtraction is the same for unsigned and signed integers, as well as the least significant half of the multiplication (which is the part that is stored), and the square of a number is also the square of its opposite. Consider:

```c++
uint32_t a = 10;
uint32_t b = 20;
uint32_t c = a - b; // fffffff6, i.e -10 or 4294967286
uint32_t d = c * c; // (ffffffec)00000064, i.e 100
```

Authors:
  - Louis Sugy (https://github.com/Nyrio)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Artem M. Chirkin (https://github.com/achirkin)

URL: #1122
@Nyrio
Copy link
Contributor

Nyrio commented Jan 18, 2023

The fix has been merged.

@Nyrio Nyrio closed this as completed Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants