[Feature request] Load full native fastText model to continue training on new data #2160

tranhungnghiep · 2018-08-24T08:48:37Z

Currently gensim cannot load and continue training native fastText model. According to the docs [1], this is because it only loads input-hidden matrix. However, fastText also saves hidden-output matrix [2].

Moreover, even the input-hidden matrix could support some sort of transfer learning, with hidden-output matrix inited randomly, similar to how gensim.models.Word2Vec.intersect_word2vec_format() works.

Please correct me if I'm wrong here, but I think there is no technical issue preventing loading and continue training fastText model. How about supporting this feature?

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2018-08-27T04:01:13Z

@tranhungnghiep thanks for the request, as I remember, FB distribute 2 type of models

only vectors .vec file (i.e. no ngrams, only 1 matrix for words) in plain text format, for loading this, you should use KeyedVectors.load_word2vec_format
full model binary .bin, FastText.load_fasttext_format should be used for ngrams & continue an training process

I think that this is a bug of current implementation (this already should works)

from gensim.models import FastText
from gensim.test.utils import common_texts


m = FastText.load_fasttext_format("wiki.ru.bin")  # load wiki FB model from https://fasttext.cc/docs/en/pretrained-vectors.html
m.build_vocab(common_texts, update=True)  # this doesn't work, but should. See also https://github.com/RaRe-Technologies/gensim/issues/2139 
"""
/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in build_vocab(self, sentences, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    480         return super(FastText, self).build_vocab(
    481             sentences, update=update, progress_per=progress_per,
--> 482             keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
    483 
    484     def _set_train_params(self, **kwargs):

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.pyc in build_vocab(self, sentences, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    805             trim_rule=trim_rule, **kwargs)
    806         report_values['memory'] = self.estimate_memory(vocab_size=report_values['num_retained_words'])
--> 807         self.trainables.prepare_weights(self.hs, self.negative, self.wv, update=update, vocabulary=self.vocabulary)
    808 
    809     def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False):

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in prepare_weights(self, hs, negative, wv, update, vocabulary)
    932 
    933     def prepare_weights(self, hs, negative, wv, update=False, vocabulary=None):
--> 934         super(FastTextTrainables, self).prepare_weights(hs, negative, wv, update=update, vocabulary=vocabulary)
    935         self.init_ngrams_weights(wv, update=update, vocabulary=vocabulary)
    936 

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.pyc in prepare_weights(self, hs, negative, wv, update, vocabulary)
   1744             self.reset_weights(hs, negative, wv)
   1745         else:
-> 1746             self.update_weights(hs, negative, wv)
   1747 
   1748     def seeded_vector(self, seed_string, vector_size):

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.pyc in update_weights(self, hs, negative, wv)
   1791             self.syn1 = vstack([self.syn1, zeros((gained_vocab, self.layer1_size), dtype=REAL)])
   1792         if negative:
-> 1793             self.syn1neg = vstack([self.syn1neg, zeros((gained_vocab, self.layer1_size), dtype=REAL)])
   1794         wv.vectors_norm = None
   1795 

AttributeError: 'FastTextTrainables' object has no attribute 'syn1neg'
"""

m.train(common_texts, epochs=1, total_examples=len(common_texts))
"""
Exception in thread Thread-17:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 164, in _worker_loop
    tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
  File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.py", line 555, in _do_train_job
    tally += train_batch_sg(self, sentences, alpha, work, neu1)
  File "gensim/models/fasttext_inner.pyx", line 276, in gensim.models.fasttext_inner.train_batch_sg
    cdef REAL_t *word_locks_vocab = <REAL_t *>(np.PyArray_DATA(model.trainables.vectors_vocab_lockf))
AttributeError: 'FastTextTrainables' object has no attribute 'vectors_vocab_lockf'
"""

Of course, I'm +1 for fix this issue -> training will work as @tranhungnghiep suggest.

Related issue - #2139

tranhungnghiep · 2018-08-27T07:05:24Z

@menshikh-iv Thanks for looking into it.

This issue is a more low-level problem, particularly FastText.load_fasttext_format() currently does not load the hidden-output matrix. After loading, we may need to do some checks and initializations related to #2139.

aviclu · 2020-09-30T20:58:41Z

Hi @menshikh-iv it seems that the hidden vectors are still bad. I'm using the gensim.models.fasttext.load_facebook_model function to load the .bin file, but the syn1 fails loading. Also trainables.syn1neg is full of zeros.

menshikh-iv · 2020-10-01T05:58:15Z

Hi @aviclu, please post more information

reproducible code example
model file
stacktrace

mpenkov · 2020-10-01T06:37:47Z

@aviclu Please open a new ticket and be sure to fill in the template.

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Aug 27, 2018

menshikh-iv mentioned this issue Oct 5, 2018

[WIP] Fix FastText #2215

Closed

3 tasks

menshikh-iv mentioned this issue Dec 13, 2018

'FastTextTrainables' object has no attribute 'vectors' #2062

Closed

menshikh-iv assigned mpenkov Dec 14, 2018

mpenkov added the fasttext Issues related to the FastText model label Dec 15, 2018

mpenkov mentioned this issue Dec 16, 2018

[WIP] Enable continuation of training of models loaded from native fastText #2299

Closed

mpenkov mentioned this issue Jan 3, 2019

Fix critical issues in FastText #2313

Merged

gojomo mentioned this issue Jan 7, 2019

Poor reproducibility of out-of-vocab word vectors after loading native model #2315

Closed

menshikh-iv closed this as completed in #2313 Jan 11, 2019

Repository owner locked as resolved and limited conversation to collaborators Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Load full native fastText model to continue training on new data #2160

[Feature request] Load full native fastText model to continue training on new data #2160

tranhungnghiep commented Aug 24, 2018 •

edited

Loading

menshikh-iv commented Aug 27, 2018 •

edited

Loading

tranhungnghiep commented Aug 27, 2018 •

edited

Loading

aviclu commented Sep 30, 2020

menshikh-iv commented Oct 1, 2020

mpenkov commented Oct 1, 2020

[Feature request] Load full native fastText model to continue training on new data #2160

[Feature request] Load full native fastText model to continue training on new data #2160

Comments

tranhungnghiep commented Aug 24, 2018 • edited Loading

menshikh-iv commented Aug 27, 2018 • edited Loading

tranhungnghiep commented Aug 27, 2018 • edited Loading

aviclu commented Sep 30, 2020

menshikh-iv commented Oct 1, 2020

mpenkov commented Oct 1, 2020

tranhungnghiep commented Aug 24, 2018 •

edited

Loading

menshikh-iv commented Aug 27, 2018 •

edited

Loading

tranhungnghiep commented Aug 27, 2018 •

edited

Loading