Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastiknn euclidean and some logging improvements #189

Merged
merged 21 commits into from
Nov 12, 2020

Conversation

alexklibisz
Copy link
Contributor

Elastiknn for Euclidean distance

This introduces Elastiknn for the Euclidean distance datasets (another pass at #180).

The results look like this, running on a c5.4xlarge in us-east:

Fashion Mnist:

image

Sift:

image

So it's pretty slow compared to the custom C/C++/in-memory solutions, but it does complete within the current time limit.

There's a bit of hackiness in that I kill any run which has throughput < 10 q/s after 100 queries. This helps prevent timeouts and wasteful computation, but it also means the runner will perpetually re-run those slow jobs. We discussed it a bit over here: #178


Logging

I also made an effort to improve the logging setup, mostly to help me debug several issues with my implementation.
I'm happy to revert these parts if you prefer the original setup.

I replaced the prints in main.py and runner.py with info-level logging through a standard python logger called annb.
Now the logs emitted by the runner look like this:

2020-11-07 03:37:03,642 - annb.13085c041b - INFO - Created container 13085c041b: CPU limit 2, mem limit 10403692117, timeout 7200, command ['--dataset', 'sift-128-euclidean', '--algorithm', 'elastiknn-l2lsh', '--module', 'ann_benchmarks.algorithms.elastiknn', '--constructor', 'L2Lsh', '--runs', '3', '--count', '100', '[100, 4, 3]', '[1000, 0]', '[1000, 6]', '[10000, 0]', '[10000, 6]']

I also created a logger that prints each containers stdout/stderr named annb.<container id>.
And the logs emitted from containers look like this (note the container IDs):

020-11-07 05:56:43,099 - annb.6857f2c833 - INFO - Run 2/3...                                                               
2020-11-07 05:56:53,507 - annb.30d74809df - INFO - Processed 1000/10000 queries...
2020-11-07 05:57:01,952 - annb.ead6762f8f - INFO - Processed 6000/10000 queries... 
2020-11-07 05:57:04,600 - annb.30d74809df - INFO - Processed 2000/10000 queries...                                          
2020-11-07 05:57:15,868 - annb.30d74809df - INFO - Processed 3000/10000 queries...                                          
2020-11-07 05:57:27,103 - annb.30d74809df - INFO - Processed 4000/10000 queries...

I configured the logging to write to stdout and to a file called annb.log.
This made it a lot easier to identify which container failed and debug it by grepping through its logs.

@alexklibisz alexklibisz changed the title Elastiknn euclidean Elastiknn euclidean and some logging improvements Nov 7, 2020
@alexklibisz alexklibisz mentioned this pull request Nov 7, 2020
@erikbern
Copy link
Owner

erikbern commented Nov 8, 2020

This looks great! Dumb q but what's the difference between Elastiknn and Elasticsearch?

@alexklibisz
Copy link
Contributor Author

This looks great! Dumb q but what's the difference between Elastiknn and Elasticsearch?

The Elasticsearch PR I made a couple days ago uses a datatype and query that comes with stock Elasticsearch. It only supports exact/exhaustive queries, so you're pretty much limited to using it as a re-scoring step in combination with a filtering query (e.g. filter for all docs matching some keyword query, and then compute vector similarity on the matched docs). It seems to be able to process about ~120k vectors/second for the 784d MNIST vectors.

Elastiknn is an Elasticsearch plugin that I implemented that adds additional vector functionality, including support for sparse vectors and approximate queries using Locality Sensitive Hashing. It's similar in scope and spirit to the K-NN plugin implemented by Amazon/Open-distro.

@erikbern
Copy link
Owner

cool, interesting

is this PR ready to be merged?

@alexklibisz
Copy link
Contributor Author

It's ready to merge as far as I'm concerned. Has Travis just checked out? I don't see any build status.

@erikbern
Copy link
Owner

not sure what's up with travis. It runs on master though. Will see if I can fix, but I'll trust that you ran the test suite locally for now!

@erikbern erikbern merged commit cc0ac15 into erikbern:master Nov 12, 2020
@alexklibisz alexklibisz deleted the elastiknn-euclidean branch March 27, 2022 21:33
erikbern added a commit that referenced this pull request Apr 14, 2023
Elastiknn euclidean and some logging improvements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants