Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastText SkipGram Implementation Broken since 3.7.2 #2508

Closed
zstachniak opened this issue May 29, 2019 · 7 comments
Closed

FastText SkipGram Implementation Broken since 3.7.2 #2508

zstachniak opened this issue May 29, 2019 · 7 comments

Comments

@zstachniak
Copy link

zstachniak commented May 29, 2019

The FastText implementation using skip-gram appears to be broken since 3.7.2. Below is the sample code I am using, which is almost identical to the example in the docs but with additional printed output. In v3.7.1, everything runs fine, but in subsequent versions, an IndexError occurs during train_sg_pair.

# Sample Code
import sys
import gensim
from gensim.models import FastText
from gensim.test.utils import common_texts
print(f"Python {sys.version.split()[0]} | Gensim {gensim.__version__}")

sim_word = "computer"

print("CBOW")
cbow = FastText(size=4, window=3, min_count=1)
cbow.build_vocab(sentences=common_texts)
cbow.train(sentences=common_texts, total_examples=len(common_texts), epochs=10)
cbow_similarities = " | ".join(
    [f"{word}: {sim:0.4f}" for (word, sim) in cbow.most_similar("computer")]
)
print(f"{sim_word}:: {cbow_similarities}")

print("Skip-Gram")
sg = FastText(size=4, window=3, min_count=1,
              sg=1)      # only difference!
sg.build_vocab(sentences=common_texts)
sg.train(sentences=common_texts, total_examples=len(common_texts), epochs=10)
sg_similarities = " | ".join(
    [f"{word}: {sim:0.4f}" for (word, sim) in sg.most_similar("computer")]
)
print(f"{sim_word}:: {sg_similarities}")

Works in 3.7.1

Python 3.7.3 | Gensim 3.7.1
CBOW
<input>:16: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
computer:: graph: 0.6275 | time: 0.4795 | interface: 0.3012 | user: 0.1459 | trees: 0.0747 | system: -0.1502 | human: -0.2375 | survey: -0.3557 | response: -0.5107 | eps: -0.5126
Skip-Gram
<input>:27: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
computer:: graph: 0.6274 | time: 0.4800 | interface: 0.3014 | user: 0.1466 | trees: 0.0741 | system: -0.1501 | human: -0.2371 | survey: -0.3564 | response: -0.5109 | eps: -0.5127

Fails in 3.7.3

Python 3.7.3 | Gensim 3.7.3
CBOW
C:\Users\yzxs008\Documents\ml_env\lib\site-packages\gensim\models\base_any2vec.py:743: UserWarning: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training.
  "C extension not loaded, training will be slow. "
<input>:16: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
computer:: human: 0.8081 | interface: 0.5414 | graph: 0.4632 | time: 0.3914 | survey: 0.0709 | eps: -0.1581 | minors: -0.1638 | trees: -0.2344 | user: -0.4144 | system: -0.4159
Skip-Gram
Exception in thread Thread-47:
Traceback (most recent call last):
  File "C:\Users\yzxs008\AppData\Local\Programs\Python\Python37\lib\threading.py", line 917, in _bootstrap_inner
    self.run()
  File "C:\Users\yzxs008\AppData\Local\Programs\Python\Python37\lib\threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\yzxs008\Documents\ml_env\lib\site-packages\gensim\models\base_any2vec.py", line 211, in _worker_loop
    tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
  File "C:\Users\yzxs008\Documents\ml_env\lib\site-packages\gensim\models\fasttext.py", line 834, in _do_train_job
    tally += train_batch_sg(self, sentences, alpha, work, neu1)
  File "C:\Users\yzxs008\Documents\ml_env\lib\site-packages\gensim\models\fasttext.py", line 412, in train_batch_sg
    train_sg_pair(model, model.wv.index2word[word2.index], subwords_indices, alpha, is_ft=True)
  File "C:\Users\yzxs008\Documents\ml_env\lib\site-packages\gensim\models\word2vec.py", line 418, in train_sg_pair
    l1_ngrams = np_sum(context_vectors_ngrams[context_index[1:]], axis=0)
IndexError: too many indices for array
@gojomo
Copy link
Collaborator

gojomo commented May 29, 2019

Given the extra error ("C extension not loaded, training will be slow"), it looks like (1) your gensim-3.7.3 installation didn't get the native libraries your earlier installation did; and (2) the gensim plain-Python code path is what's broken. That's rarely used, as it's up-to-100x slower, and thus far must be manually tested (since the normal, important testing successfully loads/tests the optimized variants).

So, @zstachniak, your local problem may be fixable by ensuring the native libraries are available. On Windows, often a 'wheel' install or 'conda' install will succeed in that, even when a 'pip install' does not. (You have to watch the install output closely; a failure to build native libraries will generate a message, but not cause the overall installation to fail.)

The gensim-side problem would require either (1) fixing-up & testing the pure-Python paths (and perhaps arranging the pure-Python paths to be auto-tested, though that'd be a pain that also slows automated testing noticeably; (2) explicitly dropping support for the plain-Python paths, improving the error messages when the optimized code isn't available.

@piskvorky
Copy link
Owner

piskvorky commented May 29, 2019

I'm inclined toward 2) We're really trying to tighten up our interfaces & remove brittle / academic fluff now.

The pure Python path may have been useful for educational reasons historically, but serves little purpose now (aside from the lack of testing / masking installation issues).

CC @mpenkov thoughts?

@zstachniak
Copy link
Author

Ah, interesting. @gojomo , any idea why a pip install on 3.7.1 work with my C compiler but 3.7.2 and above do not? I'm not seeing any messages indicating an error during install...

@zstachniak
Copy link
Author

Update: When trying to install directly from a PyPI download, I did finally encounter error messages during install (but still only for 3.7.3). For some reason, performing a pip install gensim==3.7.3 install was not warning me about any problems.

Devs, let me know if I should close this issue, and thanks for your support!

For any other Python users who are forced to use a Windows box...
After spending far too long monkeying around with Visual Studio C++ compiler support, I ended up resorting to installing gensim from a windows binary. My error message has gone away and everything is running correctly now.

@mpenkov
Copy link
Collaborator

mpenkov commented May 30, 2019

I'm +1 for removing native Python support for fasttext. I can't see a reason for using it. @menshikh-iv WDYT?

@menshikh-iv
Copy link
Contributor

@mpenkov I'm +1 for drop pure-python implementation of w2v/d2v/ft/etc and stay only cython implementations.

@mpenkov
Copy link
Collaborator

mpenkov commented May 30, 2019

OK, opened a separate ticket to deal with it. I think we can close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants