This repository contains the code and supplementary material for Learning Semantic Representations for Novel Words: Leveraging Both Form and Context and Attentive mimicking: Better word embeddings by attending to informative contexts.
Important: The code found in this directory is a beautified and easier-to-use version of the original form-context model. Due to random parameter initialization, results may slightly deviate from the ones reported in the papers mentioned above. If you want to use the form-context model, this is the right version for you. If, instead, you want to reprocude the original results, use the naacl branch for the NAACL paper results or contact me via timo.schick<at>sulzer.de
for the AAAI results.
To train your own instance of the form-context model (FCM) or Attentive Mimicking (AM), you require:
- a large text corpus (e.g., the Westbury Wikipedia corpus used in the papers cited above)
- a set of pretrained word embeddings (e.g. Glove or Word2Vec)
Before training the model, you need to preprocess the text corpus. This can be done using the fcm/preprocess.py
script:
python3 fcm/preprocess.py train --input PATH_TO_YOUR_TEXT_CORPUS --output TRAINING_DIRECTORY
If you leave all other parameters unchanged, this creates the following files in the specified output directory:
train.shuffled
: a shuffled version of your input corpus;train.shuffled.tokenized
: a shuffled, tokenized and lowercased version of your input corpus;train.vocX
,train.vwcX
: vocabulary file containing all words that occur at least X times. In thevoc
format, each line contains exactly one word. In thevwc
format, each line is of the form<word> <count>
;train.bucketX
: a bucket (or chunk) of training instances. In total, there will be 25 such buckets and for each bucket, each line is of the form<word><TAB><context1><TAB><context2><TAB>...
To get an overview of additional parameters for the preprocessing script, run python3 fcm/preprocess.py -h
.
To train a new model, use the fcm/train.py
script:
python3 fcm/train.py -m MODEL_PATH
--train_dir TRAINING_DIRECTORY
--emb_file PATH_TO_YOUR_WORD_EMBEDDINGS
--emb_dim DIMENSIONALITY_OF_YOUR_WORD_EMBEDDINGS
--vocab PATH_TO_THE_TRAIN.VWC100_FILE
By default, the training script uses Attentive Mimicking. If you instead want to train the original FCM, you must pass --sent_weights default
. Again, an overview of additional parameters for the training script can be obtained via python3 fcm/train.py -h
.
Inferring embeddings for novel words requires a file where each line is of the form <novel_word><TAB><context1><TAB><context2><TAB>...
. If you do not have a such file, you can generate it using the preprocessing script and a .voc
-file containing all the words you want embeddings for:
python3 fcm/preprocess.py test --input PATH_TO_YOUR_TEXT_CORPUS --output PATH_TO_THE_TEST_FILE --words PATH_TO_A_VOC_FILE
The acutal inference can then be done using the fcm/infer_vectors.py
script:
python3 fcm/infer_vectors.py -m MODEL_PATH -i PATH_TO_THE_TEST_FILE -o PATH_TO_THE_OUTPUT_FILE
The specified output file will then contain lines of the form <word> <embedding>
.
This directory contains the CRW development dataset. For more info, refer to the AAAI paper.
This directory contains the VecMap dataset. For more info, refer to the NAACL paper.
If you make use of the VecMap dataset or Attentive Mimicking, please cite the following paper:
@inproceedings{schick2019attentive,
title={Attentive mimicking: Better word embeddings by attending to informative contexts},
author={Schick, Timo and Sch{\"u}tze, Hinrich},
url="https://arxiv.org/abs/1904.01617",
booktitle={Proceedings of the Seventeenth Annual Conference of the North American Chapter of the Association for Computational Linguistics},
year={2019}
}
If you make use of the CRW development set or the original form-context model, please cite the following paper:
@inproceedings{schick2019learning,
title={Learning Semantic Representations for Novel Words: Leveraging Both Form and Context},
author={Schick, Timo and Sch{\"u}tze, Hinrich},
url="https://arxiv.org/abs/1811.03866",
booktitle={Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence},
year={2019}
}