Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result too large #1168

Open
vantubbe opened this issue Jan 26, 2023 · 7 comments
Open

Result too large #1168

vantubbe opened this issue Jan 26, 2023 · 7 comments
Assignees
Labels
Discussion Should be in the Discussions tab 😛 Performance

Comments

@vantubbe
Copy link

I’m running a model using DBStream, and after training on about 100k datapoints, I get 34 Result too large. I’m sure this is something I’m doing wrong and not an issue with the library. Would appreciate any suggestions on how to handle this.

@MaxHalford
Copy link
Member

Hey. How about sharing some code first? You have to make a bit of an effort if you want us to help you, we're not magicians 🧙‍♂️

@vantubbe
Copy link
Author

vantubbe commented Jan 26, 2023

@MaxHalford I apologize, really appreciate both the product and any help you're willing to give. Added an explanation and some code below.

I'm using DBStream for online topic modeling with BERTopic as described here. I run partial_fit (which calls learn_one) on batches of 10k sentences. It works great but eventually the line self.model = self.model.learn_one(umap_embedding) returns (34, Result too large). I noticed that umap_embedding is of dtype=float32. Not sure if that's related to the overflow issue.

from river import cluster
from river import stream

river_cluster_model = River(cluster.DBSTREAM(
    clustering_threshold=1,
    fading_factor=0.01,
    cleanup_interval=2,
    intersection_factor=0.3,
    minimum_weight=10.0
))

class River:
    def __init__(self, model):
        self.model = model

    def partial_fit(self, umap_embeddings):
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            self.model = self.model.learn_one(umap_embedding)

        labels = []
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            label = self.model.predict_one(umap_embedding)
            labels.append(label)

        self.labels_ = labels
        return self

    def predict(self, umap_embeddings):
        labels = []
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            label = self.model.predict_one(umap_embedding)
            labels.append(label)
        return labels

@MaxHalford
Copy link
Member

Thanks a lot @vantubbe, I'll look into it. FYI @hoanganhngo610.

@vantubbe
Copy link
Author

vantubbe commented Jan 26, 2023

Small update, was able to bypass the issue. In dbstream.py the only places I could see a possible overflow is when calculating the micro cluster weight and density. For example,

self.micro_clusters[i].weight = (
      self.micro_clusters[i].weight
      * 2 ** (-self.fading_factor * (self.time_stamp - self.micro_clusters[i].last_update)) 
      + 1
)

or

self.s[i][j] = (
        self.s[i][j]
        * 2 ** (-self.fading_factor * (self.time_stamp - self.s_t[i][j]))
        + 1
    )
self.s_t[i][j] = self.time_stamp

If self.time_stamp - self.micro_clusters[i].last_update or self.time_stamp - self.s_t[i][j] gets too large then an overflow can occur. I "bypassed" by applying a simple min.

-self.fading_factor * (min(self.time_stamp - self.micro_clusters[i].last_update, 100))

I doubt this is a valid issue. It's likely that my setup, config, or saving & loading the model is causing these weights to grow too large/small. But not sure what what situation would cause this.

@hoanganhngo610
Copy link
Contributor

Sorry for the late response @vantubbe @MaxHalford. I will have a look within this week, and hopefully come back with a response ASAP.

@lshihui
Copy link

lshihui commented Aug 15, 2023

Hi @hoanganhngo610 , I'm facing the same issue when I was using DBStream for online topic modeling with BERTopic. May I check if there's any resolution? Thank you!

@MaxHalford MaxHalford added Performance Discussion Should be in the Discussions tab 😛 labels Oct 30, 2023
@hoanganhngo610
Copy link
Contributor

Hi @vantubbe and @lshihui. I am really sorry for getting this issue slipped through for such a long time. If the problem still persists for you, would you mind giving me the actual use case that caused this error? Since, if I understand correctly, even if

self.time_stamp - self.micro_clusters[i].last_update

or

self.time_stamp - self.s_t[i][j]

is getting large, with the negative sign and the fading factor being positive, this should not cause the problem to happen at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Should be in the Discussions tab 😛 Performance
Projects
None yet
Development

No branches or pull requests

4 participants