-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter out infinities in radix-based select-k #1742
Filter out infinities in radix-based select-k #1742
Conversation
I think there is a corner case. What if the first |
This will work fine. The trick here is that if there are no bound values in the first |
Got it. Really smart strategy. |
Shoud the k-th values be always written from the end?
|
The problem is that with this implementation |
Thanks for the explanation. The code looks good to me. I suggest adding unit tests which contain However, I'm a little concerned about whether we should add such special treatment. I'll add comments in #1725, which has concrete context. |
Thanks for reminding me to write the tests! Indeed there was a bug :) forgot to filter out infinities in the last-filter (broken in the case it takes the original input data as in_buf). |
Found a bug. The test case is: Then the results are: The last value in |
Indeed, apparently the index isn't passed in the zero-th pass in the one-block kernel. Thanks! I'll fix and add your test case tomorrow. This shouldn't affect performance in any way, so I'll skip re-running all benchmarks. |
… fix the non-set in_idx_buf in the zero-th pass of the one-block kernel
Thank you @achirkin for implementing this workaround and for the detailed benchmarks presented here! Also thanks for @yong-wang for the constructive discussion and for further benchmarks. The conversaition here and in issue #1725 was really detailed and illuminating. After going through the discussion, I have the following picture (please correct me if I am wrong):
I tend to agree with Yong, that it would be preferred not to complicate further the radix-k selection code, if this only treats a corner case. So the question of whether we should integrate this PR into RAFT, depends on whether the corner case needs to be addressed or not. As Yong has pointed out, having so many infs during ANN search means
According to Artem,
These are all important points. While it is true that on average, practically there is no perf change for the non corner cases, we average the results of arbitrarily defined gbench benchmarks. I am a bit concerned about the +/- 10% affect on these benchmarks: are we sure that we average a relevant subset? In Yong's benchmark plot we see a small but noticeable perf degradation. Instead of averaging the gbench benchmarks, I believe we shall take a set of relevant ANN benchmarks, and see how this PR affects the perf there (alternatively define gbench tests that corresponds these). Because of these concerns, for the regular ANN search case, I would be happier to get a warning message like "k-th value is inf, please check your precision/normalization/filtering", instead of modifying the k-selection kernels. I am not so sure about the pre-filtering. This could still motivate the solution presented in this PR. @cjnolet do we expect to filter so many values, that less than k non-inf values remain in the end? Do we expect this to occur so often in practice, that we should add a special case for radix topk? If yes, then I would be in favor merging this. |
Add a few extra test and benchmark cases; in particular: 1. Allow specifying non-trivial input indices 2. Allow filling the input data with infinities to see how algorithms perform in edge cases These tests are borrowed from the controversial workaround #1742 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #1821
For deletion, we haven't gotten a whole lot of consensus on patterns encountered in practice, but we have been told that it's possible for the actual valid k values in a query to be less than k for some data points. We really need to be able to support the generalized cases very efficiently- so assume that not everyone's going to be returning a list with <5 materialized values, but folks that need that capability would prefer not to take a perf hit- especially since there's already going to be a perf hit in the filtering functions themselves. Aside from delete, consider other (up-coming) use-cases, like filtering recommendations for items that a user has already purchased, or multi-valued keys where a document might only be returned once even if multiple tokens for the same document end up in the list of nearest neighbors. We want these cases to be fast, but we probably don't want to do it at the expense of the non-filtered case, since I still believe that's going to be the most widely used. |
As a means of filtering, ANN methods can produce a lot of repeated
max_bound<T>/min_bound<T>
values.These are fed to a
select_k
function, which leads to poor performance if the radix-based implementation is used.This is due to the nature of the algorithm (lots of values with the same bit representation).
This fix filters out
max_bound<T>/min_bound<T>
values as a special case. It works as follows:k
values of the input for beingmax_bound<T>/min_bound<T>
and add them to the end of the output if found.max_bound<T>/min_bound<T>
are explicitly ignored; this breaks the assumption that the inputs always have enough values; the PR makes the code not rely on this assumption by slightly modifying comparisons.k
-th values (bits == kth_value_bits
) is changed to fill the output fromk - needed_num_of_kth
in order to not override themax_bound<T>/min_bound<T>
values written during the zero-th pass.Closes: #1725