-
Notifications
You must be signed in to change notification settings - Fork 749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastiknn euclidean and some logging improvements #189
Conversation
…0ms mean response
This looks great! Dumb q but what's the difference between Elastiknn and Elasticsearch? |
The Elasticsearch PR I made a couple days ago uses a datatype and query that comes with stock Elasticsearch. It only supports exact/exhaustive queries, so you're pretty much limited to using it as a re-scoring step in combination with a filtering query (e.g. filter for all docs matching some keyword query, and then compute vector similarity on the matched docs). It seems to be able to process about ~120k vectors/second for the 784d MNIST vectors. Elastiknn is an Elasticsearch plugin that I implemented that adds additional vector functionality, including support for sparse vectors and approximate queries using Locality Sensitive Hashing. It's similar in scope and spirit to the K-NN plugin implemented by Amazon/Open-distro. |
cool, interesting is this PR ready to be merged? |
It's ready to merge as far as I'm concerned. Has Travis just checked out? I don't see any build status. |
not sure what's up with travis. It runs on master though. Will see if I can fix, but I'll trust that you ran the test suite locally for now! |
Elastiknn euclidean and some logging improvements
Elastiknn for Euclidean distance
This introduces Elastiknn for the Euclidean distance datasets (another pass at #180).
The results look like this, running on a c5.4xlarge in us-east:
Fashion Mnist:
Sift:
So it's pretty slow compared to the custom C/C++/in-memory solutions, but it does complete within the current time limit.
There's a bit of hackiness in that I kill any run which has throughput < 10 q/s after 100 queries. This helps prevent timeouts and wasteful computation, but it also means the runner will perpetually re-run those slow jobs. We discussed it a bit over here: #178
Logging
I also made an effort to improve the logging setup, mostly to help me debug several issues with my implementation.
I'm happy to revert these parts if you prefer the original setup.
I replaced the
print
s in main.py and runner.py with info-level logging through a standard python logger calledannb
.Now the logs emitted by the runner look like this:
I also created a logger that prints each containers stdout/stderr named
annb.<container id>
.And the logs emitted from containers look like this (note the container IDs):
I configured the logging to write to stdout and to a file called
annb.log
.This made it a lot easier to identify which container failed and debug it by grepping through its logs.