Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch #174

Closed
erikbern opened this issue Aug 9, 2020 · 12 comments
Closed

Elasticsearch #174

erikbern opened this issue Aug 9, 2020 · 12 comments

Comments

@erikbern
Copy link
Owner

erikbern commented Aug 9, 2020

Would be interesting to add: https://opendistro.github.io/for-elasticsearch/features/knn.html

@igorbrigadir
Copy link

I'd love to have the numbers for this too - docs say it uses HNSW from NMSLIB so it would be similar to that, but maybe there are some overheads that may lead to differences in performance.

@erikbern
Copy link
Owner Author

I believe this is being implemented under #180

@alexklibisz
Copy link
Contributor

So far I've implemented ES-based nearest-neighbors for the stock vector functionality that comes with X-Pack (#186) and for my own vector search plugin: https://github.com/alexklibisz/elastiknn (#189).

I'm hoping to also find some time to implement it using Amazon's open-distro plugin, which was linked in the original issue comment above.

@stephenleo
Copy link
Contributor

I'm using opendistro at work and familiar with the KNN plugin. I have it working with ann-benchmarks but still hitting random timeouts that I'm troubleshooting. @alexklibisz , do you mind if I fix and push this one?

@alexklibisz
Copy link
Contributor

I'm using opendistro at work and familiar with the KNN plugin. I have it working with ann-benchmarks but still hitting random timeouts that I'm troubleshooting. @alexklibisz , do you mind if I fix and push this one?

Go for it! I haven't had the time to even start on it yet. Also obviously feel free to borrow from the Elastiknn and Elasticsearch docker images and algos and post questions about your timeouts. I found some things very tricky to setup with Elasticsearch.

@stephenleo
Copy link
Contributor

Oh yes your work on other elastic images helped tremendously. I'm mainly facing timeouts during refresh. I increased it to 100 from default 10 and it still fails on some runs. Wondering if I should increase it further or find some other way to handle it. I hope to update my fork over this weekend so that I can share the code for a clearer picture. Thank you!

@alexklibisz
Copy link
Contributor

Oh yes your work on other elastic images helped tremendously. I'm mainly facing timeouts during refresh. I increased it to 100 from default 10 and it still fails on some runs. Wondering if I should increase it further or find some other way to handle it. I hope to update my fork over this weekend so that I can share the code for a clearer picture. Thank you!

Good to hear. 10sec definitely seems way too low. With regular ES under the hood, I wouldn't be surprised if refreshing and merging 1M docs into a single segment takes 2-3 minutes. You also have to factor in that you're using an HNSW binary under the hood, so maybe you can get an idea of reasonable times by fitting some HNSW models without ES in the loop.

@erikbern erikbern closed this as completed Dec 9, 2020
@erikbern
Copy link
Owner Author

erikbern commented Dec 9, 2020

closing this for now since it's added right?

@alexklibisz
Copy link
Contributor

closing this for now since it's added right?

I guess it depends how granular you want to make the issues. Right now there are three ways to do KNN on elasticsearch:

  1. Using built-in functionality for exact/exhaustive KNN.
  2. Using Elastiknn for ANN
  3. Using Amazon opendistro for ANN

So far 1 and 2 are implemented and merged. @stephenleo is working on 3.

@erikbern
Copy link
Owner Author

erikbern commented Dec 9, 2020

Got it – I think it's good enough for now? Is there a huge difference between 3 and 1/2?

@alexklibisz
Copy link
Contributor

Got it – I think it's good enough for now? Is there a huge difference between 3 and 1/2?

I would not be surprised if there actually is a pretty substantial difference. 1 and 2 use the JVM exclusively, which is pretty darn slow for CPU-bound number crunching. 3 is using the C/C++ HNSW binary under the hood, which is an extra operational consideration, but if they implemented it well should be clearly faster.

@erikbern
Copy link
Owner Author

erikbern commented Dec 9, 2020

oh ok, interesting. looking forward to any PR for #3!

erikbern added a commit that referenced this issue Dec 15, 2020
Adds Open Distro Elastic Search's KNN plugin support. Closes #174.
erikbern added a commit that referenced this issue Apr 14, 2023
Adds Open Distro Elastic Search's KNN plugin support. Closes #174.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants