Modèles de Langage (Language Models)

This repo contains material from a project on Word2Vec for the course "Modèles de Langage" at Aix-Marseille Université.

The project comprises two parts: the implementation and improvement of a base Word2Vec model using negative sampling and subsampling, and the evaluation of word analogies using existing embeddings from NLPL.

The findings are documented in this report.

Setup

Install dependencies

$ python -m venv venv
$ . venv/bin/activate
$ pip install -r requirements.txt

Word2Vec Model

The Word2Vec experiment for part one of the report can be configured and executed via main.py. The corpus and evaluation set can be found in the data directory.

Analogies

Download the pre-generated embeddings:

$ wget http://vectors.nlpl.eu/repository/20/43.zip
$ unzip 43.zip -d 43

Feel free to read the metadata and README contained in the downloaded folder, then move the embeddings to the correct directory:

$ mv 43/model.txt data
$ rm -rf 43 43.zip

The analogies experiment for the second part of the report is contained in analogies.py.

As of right now, the read_file function must be called prior to executing the experiment in order to split and save the embeddings in chunks.

References

Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Yoshua Bengio and Yann LeCun, editors, 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013a. URL http://arxiv.org/abs/1301.3781.

Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26 : 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, pages 3111–3119, 2013b. URL https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html

Murhaf Fares, Andrey Kutuzov, Stephan Oepen, and Erik Velldal. Word vectors, reuse, and replicability : Towards a community repository of large-text resources. In Jörg Tiedemann and Nina Tahmasebi, editors, Proceedings of the 21st Nordic Conference on Computational Linguistics, pages 271–276, Gothenburg, Sweden, May 2017. Association for Computational Linguistics. URL https://aclanthology.org/W17-0237. Jerome H. Friedman, Jon Louis Bentley, and Raphael Ari Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS), 3(3) :209–226, 1977.

Daniel Jurafsky. Speech and language processing, 2000.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
embeddings		embeddings
plots		plots
report		report
results		results
scores		scores
times		times
.gitignore		.gitignore
README.md		README.md
analogies.py		analogies.py
eval_w2v.py		eval_w2v.py
kdtree.py		kdtree.py
main.py		main.py
utils.py		utils.py
vocab.py		vocab.py
w2v.py		w2v.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modèles de Langage (Language Models)

Setup

Install dependencies

Word2Vec Model

Analogies

References

About

Releases

Packages

Languages

ybrenning/word2vec

Folders and files

Latest commit

History

Repository files navigation

Modèles de Langage (Language Models)

Setup

Install dependencies

Word2Vec Model

Analogies

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages