Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor sizes not matching #3

Open
ericchagnon15 opened this issue Jan 20, 2023 · 1 comment
Open

Tensor sizes not matching #3

ericchagnon15 opened this issue Jan 20, 2023 · 1 comment

Comments

@ericchagnon15
Copy link

ericchagnon15 commented Jan 20, 2023

I'm trying to use this model in Google Colab with BERTopic for topic modeling and am unable to run the model. I'm using a subset of the Arxiv dataset with concatenated title and abstract for the data.

from transformers import *
ASPIRE = pipeline("feature-extraction", model="allenai/aspire-sentence-embedder")

less_docs = arxiv_docs[:200]
topic_model = BERTopic(embedding_model=ASPIRE, language="english", nr_topics="auto", verbose=True )
topics, probs = topic_model.fit_transform(less_docs)

When the fit_transform() method is called the following error occurs:
RuntimeError Traceback (most recent call last)
in
5
6 topic_model = BERTopic(embedding_model=ASPIRE, language="english", nr_topics="auto", verbose=True )
----> 7 topics, probs = topic_model.fit_transform(less_docs)

12 frames
/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
235 if self.position_embedding_type == "absolute":
236 position_embeddings = self.position_embeddings(position_ids)
--> 237 embeddings += position_embeddings
238 embeddings = self.LayerNorm(embeddings)
239 embeddings = self.dropout(embeddings)

RuntimeError: The size of tensor a (541) must match the size of tensor b (512) at non-singleton dimension 1

@MSheshera
Copy link
Contributor

I have not used the model in this manner before so I'm not sure I could say definitively what was wrong. However from the looks of it, it may be a problem with the tokenizer not truncating the input documents to 512 tokens. If BERTopic has an option to truncate the input documents you can try doing that. Or else, you can manually truncate the individual documents of arxiv_docs to have ~450 (white space tokenized) tokens or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants