Result too large #1168

vantubbe · 2023-01-26T03:33:36Z

I’m running a model using DBStream, and after training on about 100k datapoints, I get 34 Result too large. I’m sure this is something I’m doing wrong and not an issue with the library. Would appreciate any suggestions on how to handle this.

MaxHalford · 2023-01-26T07:28:06Z

Hey. How about sharing some code first? You have to make a bit of an effort if you want us to help you, we're not magicians 🧙‍♂️

vantubbe · 2023-01-26T15:43:45Z

@MaxHalford I apologize, really appreciate both the product and any help you're willing to give. Added an explanation and some code below.

I'm using DBStream for online topic modeling with BERTopic as described here. I run partial_fit (which calls learn_one) on batches of 10k sentences. It works great but eventually the line self.model = self.model.learn_one(umap_embedding) returns (34, Result too large). I noticed that umap_embedding is of dtype=float32. Not sure if that's related to the overflow issue.

from river import cluster
from river import stream

river_cluster_model = River(cluster.DBSTREAM(
    clustering_threshold=1,
    fading_factor=0.01,
    cleanup_interval=2,
    intersection_factor=0.3,
    minimum_weight=10.0
))

class River:
    def __init__(self, model):
        self.model = model

    def partial_fit(self, umap_embeddings):
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            self.model = self.model.learn_one(umap_embedding)

        labels = []
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            label = self.model.predict_one(umap_embedding)
            labels.append(label)

        self.labels_ = labels
        return self

    def predict(self, umap_embeddings):
        labels = []
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            label = self.model.predict_one(umap_embedding)
            labels.append(label)
        return labels

MaxHalford · 2023-01-26T15:46:08Z

Thanks a lot @vantubbe, I'll look into it. FYI @hoanganhngo610.

vantubbe · 2023-01-26T19:34:14Z

Small update, was able to bypass the issue. In dbstream.py the only places I could see a possible overflow is when calculating the micro cluster weight and density. For example,

self.micro_clusters[i].weight = (
      self.micro_clusters[i].weight
      * 2 ** (-self.fading_factor * (self.time_stamp - self.micro_clusters[i].last_update)) 
      + 1
)

or

self.s[i][j] = (
        self.s[i][j]
        * 2 ** (-self.fading_factor * (self.time_stamp - self.s_t[i][j]))
        + 1
    )
self.s_t[i][j] = self.time_stamp

If self.time_stamp - self.micro_clusters[i].last_update or self.time_stamp - self.s_t[i][j] gets too large then an overflow can occur. I "bypassed" by applying a simple min.

-self.fading_factor * (min(self.time_stamp - self.micro_clusters[i].last_update, 100))

I doubt this is a valid issue. It's likely that my setup, config, or saving & loading the model is causing these weights to grow too large/small. But not sure what what situation would cause this.

hoanganhngo610 · 2023-01-27T02:17:19Z

Sorry for the late response @vantubbe @MaxHalford. I will have a look within this week, and hopefully come back with a response ASAP.

lshihui · 2023-08-15T06:51:07Z

Hi @hoanganhngo610 , I'm facing the same issue when I was using DBStream for online topic modeling with BERTopic. May I check if there's any resolution? Thank you!

hoanganhngo610 · 2023-12-08T10:44:46Z

Hi @vantubbe and @lshihui. I am really sorry for getting this issue slipped through for such a long time. If the problem still persists for you, would you mind giving me the actual use case that caused this error? Since, if I understand correctly, even if

self.time_stamp - self.micro_clusters[i].last_update

or

self.time_stamp - self.s_t[i][j]

is getting large, with the negative sign and the fading factor being positive, this should not cause the problem to happen at all.

MaxHalford assigned hoanganhngo610 Jan 27, 2023

MaxHalford added Performance Discussion Should be in the Discussions tab 😛 labels Oct 30, 2023

hoanganhngo610 mentioned this issue Dec 6, 2023

DBSTREAM: update shared density #1468

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result too large #1168

Result too large #1168

vantubbe commented Jan 26, 2023

MaxHalford commented Jan 26, 2023

vantubbe commented Jan 26, 2023 •

edited by MaxHalford

Loading

MaxHalford commented Jan 26, 2023

vantubbe commented Jan 26, 2023 •

edited

Loading

hoanganhngo610 commented Jan 27, 2023

lshihui commented Aug 15, 2023

hoanganhngo610 commented Dec 8, 2023

Result too large #1168

Result too large #1168

Comments

vantubbe commented Jan 26, 2023

MaxHalford commented Jan 26, 2023

vantubbe commented Jan 26, 2023 • edited by MaxHalford Loading

MaxHalford commented Jan 26, 2023

vantubbe commented Jan 26, 2023 • edited Loading

hoanganhngo610 commented Jan 27, 2023

lshihui commented Aug 15, 2023

hoanganhngo610 commented Dec 8, 2023

vantubbe commented Jan 26, 2023 •

edited by MaxHalford

Loading

vantubbe commented Jan 26, 2023 •

edited

Loading