large dataset #81

sctrueew · 2018-12-16T15:40:50Z

Hello everyone,
I've extracted the features of 100M images and each image is an array of 4096.I have a machine with 128Gb Ram and I want to know that
What is the best parameters should I use?
Should I split into multiple indexes?
Can I use this method for large scale?

yurymalkov · 2018-12-17T15:40:20Z

Hi @zpmmehrdad,
It seems the dataset is to large fit into the memory (100M4096sizeof(float))~1.5Tb. HNSW index without the data requires much less memory.
You can try to compress your data (you can look at https://github.com/facebookresearch/faiss or https://github.com/dbaranchuk/ivf-hnsw). Compressing would help if the real dimensionality of the data is not that big - which is very likely to be the case.
Otherwise, you can use splitting.

sctrueew · 2018-12-17T16:22:20Z

Hi @yurymalkov,
Thank you for response. If I split my dataset into 10 indexes or 100 indexes, how much RAM is needed for each index?
I've used Annoy before for 10M images that used less 1Gb RAM.

yurymalkov · 2018-12-17T16:44:11Z

@zpmmehrdad Did you use offline building with annoy?
If online, annoy also stores the data, so it should take 150 Gb, not 1 Gb. If offline, it should be extremely slow.
I think I am missing something.

sctrueew · 2018-12-17T17:41:42Z

@yurymalkov I've made 10 indexes and I search parallel on all indexes and finally merge them.for this I want to use hnswlib because it's very faster and accurate than Annoy.
You think what should I do for 100M?
Can I use PCA for compression?

searchivarius · 2018-12-17T17:47:30Z

I think 4096 is too much. Very likely you can get the same accuracy after reducing dimensionality with an L2-autoencoder or PCA. PCA is simpler, L2-autoencoder is potentially more accurate. Kind regards, Leo (Leonid) Boytsov

…

On Mon, Dec 17, 2018 at 12:41 PM mehrdad mazhari ***@***.***> wrote: @yurymalkov <https://github.com/yurymalkov> I've made 10 indexes and I search parallel on all indexes and finally merge them.for this I want to use hnswlib because it's very faster and accurate than Annoy. You think what should I do for 100M? Can I use PCA for compression? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#81 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyZMrB_VGm3gy43B3APDy7uwFqGN_R0ks5u59dXgaJpZM4ZVTB6> .

sctrueew · 2018-12-17T18:08:26Z

Hi @searchivarius, thanks for reply
Can you give an example for L2-autoencoder or PCA in python?

searchivarius · 2018-12-17T18:49:12Z

Well, you need to use one of the frameworks like tensorflow or pytorch. Then, a simplest thing to do is to have a narrowing fully-connected layer and widening layer. The loss would be an L2-reconstruction loss. Of course, there are details (which non-linearities to use as activations), but there are plenty of examples to google for. For PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html Crucially, you "fit" the PCA model on some reasonably small subset and then apply it to a whole data set. Kind regards, Leo (Leonid) Boytsov

…

On Mon, Dec 17, 2018 at 1:08 PM mehrdad mazhari ***@***.***> wrote: Hi @searchivarius <https://github.com/searchivarius>, thanks for reply Can you example for L2-autoencoder or PCA in python? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#81 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyZMtEqEgI7dI4PlIC6PTDYO_i9X-7fks5u592bgaJpZM4ZVTB6> .

sctrueew · 2018-12-17T19:28:12Z

Thanks a lot @searchivarius

sctrueew · 2018-12-18T08:41:03Z

Hi @searchivarius ,
For example, I've tested100 vectors (each vector is 4096x1) and I run the code:


import random
import hnswlib
import numpy as np
from sklearn.decomposition import PCA

data = np.float32(np.random.random((100, 4096)))
pca = PCA(n_components=0.99)
new_arr = pca.fit_transform(data)
dim = new_arr.shape[1]
p = hnswlib.Index(space='l2', dim=dim)
p.set_ef(10)
p.set_num_threads(8)
p.add_items(new_arr)
p.save_index("test.bin")
labels, distances = p.knn_query(new_arr[0], k=10)

#it has no problem but when I want to search by new query 
query1= np.float32(np.random.random((1, 4096)))
pca = PCA(n_components=0.99)
new_query= pca.fit_transform(query1)
labels, distances = p.knn_query(new_query, k=10)

PCA doesn't work on a single vector
What should I do?

searchivarius · 2018-12-18T14:03:50Z

Don't call fit_transform the second time. You do fit or fit_transform only once on the *training set*. This is when you learn the model. Then you just *apply* the model by calling transform: https://scikit-learn.org/stable/data_transforms.html Kind regards, Leo (Leonid) Boytsov

…

On Tue, Dec 18, 2018 at 3:41 AM mehrdad mazhari ***@***.***> wrote: Hi @searchivarius <https://github.com/searchivarius> , For example, I've tested100 vectors (each vector is 4096x1) and I run the code: import random import hnswlib import numpy as np from sklearn.decomposition import PCA data = np.float32(np.random.random((100, 4096))) pca = PCA(n_components=0.99) new_arr = pca.fit_transform(data) dim = new_arr.shape[1] p = hnswlib.Index(space='l2', dim=dim) p.set_ef(10) p.set_num_threads(8) p.add_items(new_arr) p.save_index("test.bin") labels, distances = p.knn_query(new_arr[0], k=10) it has no problem but when I want to search by new query query1= np.float32(np.random.random((1, 4096))) pca = PCA(n_components=0.99) new_query= pca.fit_transform(query1) labels, distances = p.knn_query(new_query, k=10) PCA doesn't work on a single vector What should I do? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#81 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyZMpi_ZpT1sFnPFpwdWA3r-Po9-Bktks5u6KoggaJpZM4ZVTB6> .

sctrueew · 2018-12-19T05:48:56Z

@searchivarius I didn't call fit_transform the second time

new_query= np.float32(np.random.random((1, 4096)))
labels, distances = p.knn_query(new_query, k=10)

But unfortunately the results are wrong.

searchivarius · 2018-12-19T06:50:50Z

new_query= *pca.fit_transform(query1)* labels, distances = p.knn_query(new_query, k=10) PCA doesn't work on a single vector What should I do?

…

On Wed, Dec 19, 2018, 12:48 AM mehrdad mazhari ***@***.*** wrote: @searchivarius <https://github.com/searchivarius> I didn't call fit_transform the second time new_query= np.float32(np.random.random((1, 4096))) labels, distances = p.knn_query(new_query, k=10) But unfortunately the results are wrong. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#81 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyZMk6urn2a7S2sd8cquoLSVTf7soDgks5u6dNJgaJpZM4ZVTB6> .

searchivarius · 2018-12-19T06:51:43Z

Also random 4096 dim data isn't amenable to knn search.

…

On Wed, Dec 19, 2018, 1:50 AM Leo Boytsov ***@***.*** wrote: new_query= *pca.fit_transform(query1)* labels, distances = p.knn_query(new_query, k=10) PCA doesn't work on a single vector What should I do? On Wed, Dec 19, 2018, 12:48 AM mehrdad mazhari ***@***.*** wrote: > @searchivarius <https://github.com/searchivarius> I didn't call > fit_transform the second time > > new_query= np.float32(np.random.random((1, 4096))) > labels, distances = p.knn_query(new_query, k=10) > > But unfortunately the results are wrong. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#81 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAyZMk6urn2a7S2sd8cquoLSVTf7soDgks5u6dNJgaJpZM4ZVTB6> > . >

sctrueew · 2018-12-19T07:07:05Z

execue me @searchivarius, could you please an example? I quite confused
thanks.

yurymalkov · 2018-12-19T09:05:20Z

@zpmmehrdad You should try the same (e.g. train on the randomly selected portion of the dataset, transform the query using trained PCA), but on real data. The results are wrong because PCA does not work on random data - it can help a lot in case the data is correlated between the dimensions (this is true for many real datasets).
You should be able to see how the accuracy of the search degrades with decreasing n_components from 4096 to 1. When n_components=4096 there should be no change in accuracy.
You can also try quantization approaches (e.g. the ones in faiss), but they are harder to handle and tune.

preetim96 · 2019-03-06T11:26:57Z

Hello @yurymalkov and @searchivarius
What changes I have to do so that the index is created on ssd disk and searching is also performed in Disk Index ?

yurymalkov · 2019-03-06T11:50:10Z

@preetim96 Probably, instead of allocating the chunk of the memory you would need to memmap it to the SSD.

preetim96 · 2019-03-06T15:16:51Z

Hello @yurymalkov Thanks for reply
Can you provide a code snippet to do this?

pawanm09 · 2019-03-28T14:39:25Z

@yurymalkov @searchivarius
When we memmap the index to the disk. Then for searching do we need to load the index into RAM again?
Or we can perform the search in disk index itself.

sctrueew changed the title ~~Build a large dataset~~ large dataset Dec 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large dataset #81

large dataset #81

sctrueew commented Dec 16, 2018 •

edited

Loading

yurymalkov commented Dec 17, 2018

sctrueew commented Dec 17, 2018

yurymalkov commented Dec 17, 2018

sctrueew commented Dec 17, 2018

searchivarius commented Dec 17, 2018 via email

sctrueew commented Dec 17, 2018 •

edited

Loading

searchivarius commented Dec 17, 2018 via email

sctrueew commented Dec 17, 2018

sctrueew commented Dec 18, 2018 •

edited

Loading

searchivarius commented Dec 18, 2018 via email

sctrueew commented Dec 19, 2018

searchivarius commented Dec 19, 2018 via email

searchivarius commented Dec 19, 2018 via email

sctrueew commented Dec 19, 2018

yurymalkov commented Dec 19, 2018

preetim96 commented Mar 6, 2019 •

edited

Loading

yurymalkov commented Mar 6, 2019

preetim96 commented Mar 6, 2019

pawanm09 commented Mar 28, 2019

large dataset #81

large dataset #81

Comments

sctrueew commented Dec 16, 2018 • edited Loading

yurymalkov commented Dec 17, 2018

sctrueew commented Dec 17, 2018

yurymalkov commented Dec 17, 2018

sctrueew commented Dec 17, 2018

searchivarius commented Dec 17, 2018 via email

sctrueew commented Dec 17, 2018 • edited Loading

searchivarius commented Dec 17, 2018 via email

sctrueew commented Dec 17, 2018

sctrueew commented Dec 18, 2018 • edited Loading

searchivarius commented Dec 18, 2018 via email

sctrueew commented Dec 19, 2018

searchivarius commented Dec 19, 2018 via email

searchivarius commented Dec 19, 2018 via email

sctrueew commented Dec 19, 2018

yurymalkov commented Dec 19, 2018

preetim96 commented Mar 6, 2019 • edited Loading

yurymalkov commented Mar 6, 2019

preetim96 commented Mar 6, 2019

pawanm09 commented Mar 28, 2019

sctrueew commented Dec 16, 2018 •

edited

Loading

sctrueew commented Dec 17, 2018 •

edited

Loading

sctrueew commented Dec 18, 2018 •

edited

Loading

preetim96 commented Mar 6, 2019 •

edited

Loading