topc_model.transform() breaks the kernel and have to restart the whole notebook again #1092

hrishbhdalal · 2023-03-14T09:44:14Z

After training the model and saving it to the disk after using the Rapids library based UMAP and HDBSCAN, when I reload the model and use .transform(), it literally crashed my Kernel and I have to run the entire thing again. The strange thing is that it does not happen when I use the model as soon as it is trained. This happens after I have trained the model and saved it to the local disk. I initially thought it were a memory issue, but inferencing on a single document also creates this issue and ruins the whole progress.

My system is Ubuntu 20.04
rapids - 22.12, python 3.9.15
bertopic - 0.13
GPU - 3090ti

I am training and inferencing on ~ 2 million documents created out of tweets. If I do not use Rapids, it works fine but it messes up when I use the rapids.

The code looks like this - I am not showing the embeddings creations as that code is very long as I am using a custom HFTransformerBackend for one of the BERT models optmized for tweets.

from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
umap_model = UMAP(n_components=10, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=2, gen_min_span_tree=True, prediction_data=True) #
# this is to create the new countvectorizer to handle the custom naming and topics
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
german_stop_words = stopwords.words('german')
vectorizer_model = CountVectorizer(stop_words = german_stop_words, ngram_range = (1,2), max_features=20000)
topic_model = BERTopic(
    embedding_model=HFTransformerBackend,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    top_n_words=10,
    diversity=0.2,
    nr_topics=75,
    # calculate_probabilities=True
    # min_topic_size=int(0.001*len(docs))
    )
topics, probs = topic_model.fit_transform(docs)

I think the main culprit here is the HDBSCAN model as this is the process where the GPU maxes out to 100% and then breaks. Please help, I have wasted couple of days just figuring this out already.

The text was updated successfully, but these errors were encountered:

beckernick · 2023-03-15T14:24:25Z

If you were able to fit the model but this is happening during HDBSCAN's use in model.transform, you may be running out of memory during the call to approximate_predict. This should have a fairly low memory requirement, but may require more than the available unused memory on your GPU depending on what else is on there. For example, it's possible that your DL model is reserving some GPU memory that cuML isn't able to access (creating an artificial out of memory error). If this is happening and you're able to use PyTorch 2 and RAPIDS 23.02, you may be able avoid this problem by using a single memory allocator for both PyTorch and cuML (or by running transform in stages).

If this happens during fit when calculate_probabilities=True, the problem is likely that your HDBSCAN model may be finding many clusters with that set of parameters for that dataset size. This cuML issue will hopefully provide a workaround. We're actively working to reduce the memory requirements for HDBSCAN.

If you don't need calculate_probabilities=True, you may be able to use cuML's UMAP but not HDBSCAN. This blog illustrates where time is spent in an example BERTopic workflow, and when calculate_probabilities=False it's often not critical to run HDBSCAN on the GPU (in comparison to the embeddings and UMAP steps).

MaartenGr closed this as completed Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topc_model.transform() breaks the kernel and have to restart the whole notebook again #1092

topc_model.transform() breaks the kernel and have to restart the whole notebook again #1092

hrishbhdalal commented Mar 14, 2023

beckernick commented Mar 15, 2023 •

edited

Loading

topc_model.transform() breaks the kernel and have to restart the whole notebook again #1092

topc_model.transform() breaks the kernel and have to restart the whole notebook again #1092

Comments

hrishbhdalal commented Mar 14, 2023

beckernick commented Mar 15, 2023 • edited Loading

beckernick commented Mar 15, 2023 •

edited

Loading