3.4.0
3.4.0, 2018-03-01
π New features:
-
Massive optimizations of
gensim.models.LdaModel
: much faster training, using Cython. (@arlenk, #1767)-
Training benchmark π₯
dataset old LDA [sec] optimized LDA [sec] speed up nytimes 3473 1975 1.76x enron 774 437 1.77x -
This change affects all models that depend on
LdaModel
, such asLdaMulticore
,LdaSeqModel
,AuthorTopicModel
.
-
-
Huge speed-ups to corpus I/O with
MmCorpus
(Cython) (@arlenk, #1825)-
File reading benchmark
dataset file compressed? old MmReader [sec] optimized MmReader [sec] speed up enron no 22.3 2.6 8.7x yes 37.3 14.4 2.6x nytimes no 419.3 49.2 8.5x yes 686.2 275.1 2.5x text8 no 25.4 2.5 10.1x yes 41.9 17.0 2.5x -
Overall, a 2.5x speedup for compressed
.mm.gz
input and 8.5x π₯π₯π₯ for uncompressed plaintext.mm
.
-
-
Performance and memory optimization to
gensim.models.FastText
π (@jbaiter, #1916)-
Benchmark (first 500,000 articles from English Wikipedia)
Metric old FastText optimized FastText improvement Training time (1 epoch) 4823.4s (80.38 minutes) 1873.6s (31.22 minutes) 2.57x Training time (full) 1h 26min 13s 36min 43s 2.35x Training words/sec 72,781 187,366 2.57x Training peak memory 5.2 GB 3.7 GB 1.4x -
Overall, a 2.5x speedup & memory usage reduced by 30%.
-
-
Implemented Soft Cosine Measure (@Witiko, #1827)
-
New method for assessing document similarity, a nice faster alternative to WMD, Word Mover's Distance
-
Benchmark
Technique MAP score Duration softcossim 45.99 1.24 sec wmd-relax 44.48 12.22 sec cossim 44.22 4.39 sec wmd-gensim 44.08 98.29 sec -
Soft Cosine notebook with detailed description, examples & benchmarks
-
Related papers:
-
π Improvements:
- New method to show the Gensim installation parameters:
python -m gensim.scripts.package_info --info
. Use this when reporting problems, for easier debugging. Fix #1902 (@sharanry, #1903) - Added a flag to optionally skip network-related tests, to help maintainers avoid network issues with CI services (@menshikh-iv, #1930)
- Added
license
field tosetup.py
, allowing the use of tools likepip-licenses
(@nils-werner, #1909)
π΄ Bug fixes:
- Fix Python 3 compatibility for
gensim.corpora.UciCorpus.save_corpus
(@darindf, #1875) - Add
wv
property to KeyedVectors for backward compatibility. Fix #1882 (@manneshiva, #1884) - Fix deprecation warning from
inspect.getargspec
. Fix #1878 (@aneesh-joshi, #1887) - Add
LabeledSentence
togensim.models.doc2vec
for backward compatibility. Fix #1886 (@manneshiva, #1891) - Fix empty output bug in
Phrases
(when usingmodel[tokens]
twice). Fix #1401 (@sj29-innovate, #1853) - Fix type problems for
D2VTransformer.fit_transform
. Fix #1834 (@Utkarsh-Mishra-CIC, #1845) - Fix
datatype
parameter forKeyedVectors.load_word2vec_format
. Fix #1682 (@pushpankar, #1819) - Fix deprecated parameters in
doc2vec-lee
notebook (@TheFlash10, #1918) - Fix file-like closing bug in
gensim.corpora.MmCorpus
. Fix #1869 (@sj29-innovate, #1911) - Fix precision problem in
test_similarities.py
, no more FP fails. (@menshikh-iv, #1928) - Fix encoding in Lee corpus reader. (@menshikh-iv, #1931)
- Fix OOV pairs counter in
WordEmbeddingsKeyedVectors.evaluate_word_pairs
. (@akutuzov, #1934)
π Tutorial and doc improvements:
- Fix example block for
gensim.models.Word2Vec
(@nzw0301, #1870) - Fix
doc2vec-lee
notebook (@numericlee, #1870) - Store images from
README.md
directly in repository. Fix #1849 (@ibrahimsharaf, #1861) - Add windows venv activate command to
CONTRIBUTING.md
(@aneesh-joshi, #1880) - Add anaconda-cloud badge. Partial fix #1901 (@sharanry, #1905)
- Fix docstrings for lsi-related code (@steremma, #1892)
- Fix parameter description of
sg
parameter forgensim.models.word2vec
(@mdcclv, #1919) - Refactor documentation for
gensim.similarities.docsim
andMmCorpus-related
. (@CLearERR & @menshikh-iv, #1910) - Fix docstrings for
gensim.test.utils
(@yurkai & @menshikh-iv, #1904) - Refactor docstrings for
gensim.scripts
. Partial fix #1665 (@yurkai & @menshikh-iv, #1792) - Refactor API reference
gensim.corpora
. Partial fix #1671 (@CLearERR & @menshikh-iv, #1835) - Fix documentation for
gensim.models.wrappers
(@kakshay21 & @menshikh-iv, #1859) - Fix docstrings for
gensim.interfaces
(@yurkai & @menshikh-iv, #1913)
β οΈ Deprecations (will be removed in the next major release)
-
Remove
gensim.models.wrappers.fasttext
(obsoleted by the new nativegensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new nativegensim.scripts.segment_wiki
implementation)- "deprecated" functions and attributes
-
Move
gensim.scripts.make_wikicorpus
β‘gensim.scripts.make_wiki.py
gensim.summarization
β‘gensim.models.summarization
gensim.topic_coherence
β‘gensim.models._coherence
gensim.utils
β‘gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
β‘gensim.utils.text_utils