Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pickle Approximate NearestNeighbors models #4743

Open
tfeher opened this issue May 19, 2022 · 4 comments
Open

[BUG] Pickle Approximate NearestNeighbors models #4743

tfeher opened this issue May 19, 2022 · 4 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d

Comments

@tfeher
Copy link
Contributor

tfeher commented May 19, 2022

Describe the bug

Approximate nearest neighbor models ('ivfflat', 'ivfpq') store their state is a knnIndex object. Currently there is no support to pickle models that were fitted using these algorithms. The error only shows while predicting with the loaded model.

Steps/Code to reproduce bug

import cudf
from cuml.neighbors import NearestNeighbors
from cuml.datasets import make_blobs
X, _ = make_blobs(n_samples=50, centers=5, n_features=10, random_state=42)
X_cudf = cudf.DataFrame(X)

# fit model
model = NearestNeighbors(n_neighbors=3, algorithm='ivfflat')
model.fit(X)

# pickle the model 
import pickle
pickle.dump(model, open("ann_model.pkl", "wb"))

Now start a new process (e.g. new Jupyter kernel)

import cudf
from cuml.neighbors import NearestNeighbors
from cuml.datasets import make_blobs
X, _ = make_blobs(n_samples=50, centers=5, n_features=10, random_state=42)
X_cudf = cudf.DataFrame(X)

import pickle
model_loaded = pickle.load(open("knn_model.pkl", "rb"))

distances2, indices2 = model_loaded.kneighbors(X_cudf)

This will result in the process dying. This is probably due to accessing the model state through knnIndex pointer, which was just saved/restored as int values, but does not point to a valid object if the process is restarted. (One can see this by observing the 'knn_index' value in the dict returned by model.__getstate__()).

Expected behavior
Pickling and loading the model shall work. To achieve this ANN models need to serialize / deserialize their knnIndex object while pickling the model.

Environment details (please complete the following information):

  • Tested using 22.04 conda packages.
@viclafargue
Copy link
Contributor

Thank you for spotting that. Unfortunately, it looks like there is no simple solution for this right now. Indeed the knnIndex struct contains GPU resources handled by FAISS. However, if we develop our own ANN algorithms it might become easier to serialize the necessary data though.

@tfeher
Copy link
Contributor Author

tfeher commented Jun 3, 2022

Yes, I agree. We can return to this question after rapidsai/raft#652

@github-actions
Copy link

github-actions bot commented Jul 3, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Oct 1, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapids-bot bot pushed a commit to rapidsai/raft that referenced this issue Jan 3, 2023
This PR implements serialization to file for  `ivf_pq::index` and `ivf_flat::index` structures.

Index building takes time, therefore downstream projects (like cuML) want to save the index (rapidsai/cuml#4743). But downstream project should not depend on the implementation details of the index, therefore RAFT provides methods to serialize and deserialize the index.

This is still experimental:
- ideally we want to use a general serialization method for mdspan #770,
- instead of directly saving to file, raft should provide a byte string and let the downstream project decide how to save it (e.g. pickle for cuML).

Python wrappers are provided for IVF-PQ to save/load the index.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #919
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d
Projects
None yet
Development

No branches or pull requests

2 participants