Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topc_model.transform() breaks the kernel and have to restart the whole notebook again #1092

Closed
hrishbhdalal opened this issue Mar 14, 2023 · 1 comment

Comments

@hrishbhdalal
Copy link

After training the model and saving it to the disk after using the Rapids library based UMAP and HDBSCAN, when I reload the model and use .transform(), it literally crashed my Kernel and I have to run the entire thing again. The strange thing is that it does not happen when I use the model as soon as it is trained. This happens after I have trained the model and saved it to the local disk. I initially thought it were a memory issue, but inferencing on a single document also creates this issue and ruins the whole progress.

My system is Ubuntu 20.04
rapids - 22.12, python 3.9.15
bertopic - 0.13
GPU - 3090ti

I am training and inferencing on ~ 2 million documents created out of tweets. If I do not use Rapids, it works fine but it messes up when I use the rapids.

The code looks like this - I am not showing the embeddings creations as that code is very long as I am using a custom HFTransformerBackend for one of the BERT models optmized for tweets.

from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
umap_model = UMAP(n_components=10, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=2, gen_min_span_tree=True, prediction_data=True) #
# this is to create the new countvectorizer to handle the custom naming and topics
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
german_stop_words = stopwords.words('german')
vectorizer_model = CountVectorizer(stop_words = german_stop_words, ngram_range = (1,2), max_features=20000)
topic_model = BERTopic(
    embedding_model=HFTransformerBackend,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    top_n_words=10,
    diversity=0.2,
    nr_topics=75,
    # calculate_probabilities=True
    # min_topic_size=int(0.001*len(docs))
    )
topics, probs = topic_model.fit_transform(docs)

I think the main culprit here is the HDBSCAN model as this is the process where the GPU maxes out to 100% and then breaks. Please help, I have wasted couple of days just figuring this out already.

@beckernick
Copy link
Contributor

beckernick commented Mar 15, 2023

If you were able to fit the model but this is happening during HDBSCAN's use in model.transform, you may be running out of memory during the call to approximate_predict. This should have a fairly low memory requirement, but may require more than the available unused memory on your GPU depending on what else is on there. For example, it's possible that your DL model is reserving some GPU memory that cuML isn't able to access (creating an artificial out of memory error). If this is happening and you're able to use PyTorch 2 and RAPIDS 23.02, you may be able avoid this problem by using a single memory allocator for both PyTorch and cuML (or by running transform in stages).

If this happens during fit when calculate_probabilities=True, the problem is likely that your HDBSCAN model may be finding many clusters with that set of parameters for that dataset size. This cuML issue will hopefully provide a workaround. We're actively working to reduce the memory requirements for HDBSCAN.

If you don't need calculate_probabilities=True, you may be able to use cuML's UMAP but not HDBSCAN. This blog illustrates where time is spent in an example BERTopic workflow, and when calculate_probabilities=False it's often not critical to run HDBSCAN on the GPU (in comparison to the embeddings and UMAP steps).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants