Is there a way to specify different algo args based on the dataset? #178

alexklibisz · 2020-09-09T21:46:12Z

I've been working on integrating elastiknn: https://github.com/alexklibisz/elastiknn
Getting close to making a PR, but I have a case where the good args for the one dataset are different than the good args for another, both with euclidean distance.
I guess I could just include them all, but that seems like a waste of compute resources since I know what's good for one dataset will work poorly for another.
Since ES is on the JVM, data is on disk, and every query is an HTTP request, it's quite a bit slower than the other in-memory C/C++ implementations.
Would hate to bottleneck your updates.

Here's a sneak peek for SIFT (using my own benchmarking plots):

So, is there some way I can setup algos.yaml to only use one set of args for dataset A and use another set of args for dataset B?

The text was updated successfully, but these errors were encountered:

erikbern · 2020-09-10T01:43:15Z

There's no way. I think it's fine to include both parameter sets, although if they are wildly different I'd be a bit nervous they are cherry picked. Looks like on the scatter plot like there's a very very large set of points, so I'd recommend pruning it down to no more than 20-50 different parameters.

Very excited about including elastiknn!

alexklibisz · 2020-10-14T20:59:23Z

@erikbern I'm still stuck on this and want to propose a solution and get your feedback before PRing it.

To recap, basically there is one parameter (the LSH hashing width parameter w) which needs to be around 6 or 7 for good performance on the Fashion-MNIST dataset, and around 1 or 2 for the SIFT dataset. There's a pretty intuitive explanation, which I detailed here. If you set w=1 or w=2 for Fashion-MNIST, there's really no issue, the recall is just poor and you move on. If you set w to 6 or 7 for SIFT, each query matches 50-70% of the corpus as approximate candidates and Lucene takes an extremely long time to count up the top k approximate matches. So the run ends up timing out and wasting time/money.

I've thought about ways to add some early stopping heuristic within Elastiknn, but hate the idea of introducing magic numbers when, IMO, the real solution is to understand the distribution of your data and how it affects parameter choice. I've also documented good parameters here.

So, my proposed solution: in the Elastiknn "algorithm" class in this repo, I'll monitor response times for queries. If after 100 queries the mean response time is > 100ms, I just sys.exit(0). It's a bit hacky but seems to be the most reasonable compromise. LMK your thoughts when you get a chance.

maumueller · 2020-10-15T19:40:07Z

@alexklibisz I'm wondering if you cannot figure out w from sampling points during index building and base w on the observed distances between points in the sample? There is also always the option to figure out which dataset you are currently running on by looking at the first coordinates of the first vector in the array provided in fit. (This is of course very hacky, but it seems that this is basically what you want to know.)

In general, it might improve the time to run the full benchmark quite a bit if we were to allow to specific dataset specific settings (e.g., https://github.com/erikbern/ann-benchmarks/blob/master/algos.yaml#L545-L553 looks also fishy). For example, the dataset could be added as a third level of the hierarchy in https://github.com/erikbern/ann-benchmarks/blob/master/algos.yaml. The standard format would be sth like float -> any -> any, with the option of saying float -> euclidean -> sift-128. Only a few lines in https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/algorithms/definitions.py#L105.

I'm split here, because it opens up the window for micro-optimizations to the dataset. (Of course, this window was always open by just adding many run groups to the definition file, as above.)

alexklibisz · 2020-10-15T23:20:37Z

Hi @maumueller . Thanks for the input! Generally Elastiknn operates under the assumption that there is no explicit "fitting" or "building" phase. As with regular Elasticsearch documents, you can insert/update/delete vectors as you would any old Elasticsearch document. I'm not sure if a fitting/sampling step would solve this problem, since I still couldn't say "these sampled values are for dataset foo, these others are for dataset bar"

I agree it's a tough call whether to allow dataset-specific parameters. I haven't surveyed all of the models, but I'd imagine there are at least a few others with some sensitivity to the values of the data.

I think technically there's nothing wrong with having some of the containers run for two hours and then just time out. They don't blow up the whole run. However I hate to waste @erikbern 's money :).

erikbern · 2020-10-19T19:57:16Z

I think just killing the container after 2h is fine. I might make it 1h actually. Going forward I'm planning to run on a high-RAM machine and run a lot of algorithms in parallel so that it doesn't run for several weeks

erikbern closed this as completed Sep 10, 2020

alexklibisz mentioned this issue Oct 15, 2020

Introduce parameter to limit the number of matched vectors for approximate queries alexklibisz/elastiknn#170

Merged

alexklibisz mentioned this issue Nov 7, 2020

Elastiknn euclidean and some logging improvements #189

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to specify different algo args based on the dataset? #178

Is there a way to specify different algo args based on the dataset? #178

alexklibisz commented Sep 9, 2020 •

edited

Loading

erikbern commented Sep 10, 2020 •

edited

Loading

alexklibisz commented Oct 14, 2020

maumueller commented Oct 15, 2020

alexklibisz commented Oct 15, 2020

erikbern commented Oct 19, 2020

Is there a way to specify different algo args based on the dataset? #178

Is there a way to specify different algo args based on the dataset? #178

Comments

alexklibisz commented Sep 9, 2020 • edited Loading

erikbern commented Sep 10, 2020 • edited Loading

alexklibisz commented Oct 14, 2020

maumueller commented Oct 15, 2020

alexklibisz commented Oct 15, 2020

erikbern commented Oct 19, 2020

alexklibisz commented Sep 9, 2020 •

edited

Loading

erikbern commented Sep 10, 2020 •

edited

Loading