Elastiknn euclidean and some logging improvements #189

alexklibisz · 2020-11-07T16:43:13Z

Elastiknn for Euclidean distance

This introduces Elastiknn for the Euclidean distance datasets (another pass at #180).

The results look like this, running on a c5.4xlarge in us-east:

Fashion Mnist:

Sift:

So it's pretty slow compared to the custom C/C++/in-memory solutions, but it does complete within the current time limit.

There's a bit of hackiness in that I kill any run which has throughput < 10 q/s after 100 queries. This helps prevent timeouts and wasteful computation, but it also means the runner will perpetually re-run those slow jobs. We discussed it a bit over here: #178

Logging

I also made an effort to improve the logging setup, mostly to help me debug several issues with my implementation.
I'm happy to revert these parts if you prefer the original setup.

I replaced the prints in main.py and runner.py with info-level logging through a standard python logger called annb.
Now the logs emitted by the runner look like this:

2020-11-07 03:37:03,642 - annb.13085c041b - INFO - Created container 13085c041b: CPU limit 2, mem limit 10403692117, timeout 7200, command ['--dataset', 'sift-128-euclidean', '--algorithm', 'elastiknn-l2lsh', '--module', 'ann_benchmarks.algorithms.elastiknn', '--constructor', 'L2Lsh', '--runs', '3', '--count', '100', '[100, 4, 3]', '[1000, 0]', '[1000, 6]', '[10000, 0]', '[10000, 6]']

I also created a logger that prints each containers stdout/stderr named annb.<container id>.
And the logs emitted from containers look like this (note the container IDs):

020-11-07 05:56:43,099 - annb.6857f2c833 - INFO - Run 2/3...                                                               
2020-11-07 05:56:53,507 - annb.30d74809df - INFO - Processed 1000/10000 queries...
2020-11-07 05:57:01,952 - annb.ead6762f8f - INFO - Processed 6000/10000 queries... 
2020-11-07 05:57:04,600 - annb.30d74809df - INFO - Processed 2000/10000 queries...                                          
2020-11-07 05:57:15,868 - annb.30d74809df - INFO - Processed 3000/10000 queries...                                          
2020-11-07 05:57:27,103 - annb.30d74809df - INFO - Processed 4000/10000 queries...

I configured the logging to write to stdout and to a file called annb.log.
This made it a lot easier to identify which container failed and debug it by grepping through its logs.

…0ms mean response

… early stopping.

erikbern · 2020-11-08T19:33:04Z

This looks great! Dumb q but what's the difference between Elastiknn and Elasticsearch?

alexklibisz · 2020-11-08T19:48:46Z

This looks great! Dumb q but what's the difference between Elastiknn and Elasticsearch?

The Elasticsearch PR I made a couple days ago uses a datatype and query that comes with stock Elasticsearch. It only supports exact/exhaustive queries, so you're pretty much limited to using it as a re-scoring step in combination with a filtering query (e.g. filter for all docs matching some keyword query, and then compute vector similarity on the matched docs). It seems to be able to process about ~120k vectors/second for the 784d MNIST vectors.

Elastiknn is an Elasticsearch plugin that I implemented that adds additional vector functionality, including support for sparse vectors and approximate queries using Locality Sensitive Hashing. It's similar in scope and spirit to the K-NN plugin implemented by Amazon/Open-distro.

erikbern · 2020-11-11T16:22:48Z

cool, interesting

is this PR ready to be merged?

alexklibisz · 2020-11-11T20:23:22Z

It's ready to merge as far as I'm concerned. Has Travis just checked out? I don't see any build status.

erikbern · 2020-11-12T14:15:35Z

not sure what's up with travis. It runs on master though. Will see if I can fix, but I'll trust that you ran the test suite locally for now!

Elastiknn euclidean and some logging improvements

alexklibisz added 21 commits September 16, 2020 20:59

Elastiknn for Euclidean datasets

367a19a

Update versions in dockerfile

b7b4000

print container short id on failure

468d358

Added logging to runner.py. Curent algos.yml works for sift.

7d767f5

Still working

2371f51

Ignore results except for pngs

c8c9f5a

Check for slow elastiknn queries and abort if > 100 queries with > 10…

a03bed1

…0ms mean response

Decreased requirement to 40ms

4a4f355

Fashion mnist should exceed (0.8, 200) now

b4672a0

Upgrade dockerfile to 7.9.2

df15d44

Install latest python

9e857e5

Fix typo

0816830

Switched back to early stopping. SIFT matches at (0.78, 51)

9e9770e

Add L=100

285b41a

Using logging.conf file to configure logging. Fixed bugs in elastiknn…

bfc6144

… early stopping.

Merge branch 'master' into elastiknn-euclidean

aff8591

Add .log to gitignore

ac9f12a

Use patched python client

4a3c853

Latest elastiknn versions

a9ab373

Include Elastiknn in readme

3ec70f0

disable elastiknn-exact

f3a59b4

alexklibisz changed the title ~~Elastiknn euclidean~~ Elastiknn euclidean and some logging improvements Nov 7, 2020

alexklibisz mentioned this pull request Nov 7, 2020

Elasticsearch #174

Closed

erikbern merged commit cc0ac15 into erikbern:master Nov 12, 2020

alexklibisz deleted the elastiknn-euclidean branch March 27, 2022 21:33

erikbern added a commit that referenced this pull request Apr 14, 2023

Merge pull request #189 from alexklibisz/elastiknn-euclidean

53bb833

Elastiknn euclidean and some logging improvements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastiknn euclidean and some logging improvements #189

Elastiknn euclidean and some logging improvements #189

alexklibisz commented Nov 7, 2020

erikbern commented Nov 8, 2020

alexklibisz commented Nov 8, 2020

erikbern commented Nov 11, 2020

alexklibisz commented Nov 11, 2020

erikbern commented Nov 12, 2020

Elastiknn euclidean and some logging improvements #189

Elastiknn euclidean and some logging improvements #189

Conversation

alexklibisz commented Nov 7, 2020

erikbern commented Nov 8, 2020

alexklibisz commented Nov 8, 2020

erikbern commented Nov 11, 2020

alexklibisz commented Nov 11, 2020

erikbern commented Nov 12, 2020