The code for Improve word representation learning with sememes(ACL2017).
Using the following command to train word-sense-sememe embeddings.
cp SSA.c[SSA.c/MST.c/SAC.c/SAT.c] word2vec/word2vec.c
cd word2vec
make
./word2vec -train TrainFile -output vectors.bin -cbow 0 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 30 -binary 1 -iter 1 -read-vocab VocabFile -read-meaning SememeFile -read-sense Word_Sense_Sememe_File -min-count 1 -alpha 0.025
TrainFile
is train data set. The following three files can be found in directory datasets
. VocabFile
is the word vocabulary file, and SememeFile
is the sememe vocabulary file. Word_Sense_Sememe_File
is a file recording group information of word-sense-sememe.
Before training, you should replace word2vec/word2vec.c
with one of the four files SSA.c/MST.c/SAC.c/SAT.c
.
HowNet.txt
is an Chinese knowledge base with annotated word-sense-sememe information.
Sogou-T(sample).txt
is a sample dataset extracted from Sogou-T
.
Complete training dataset Clean-SogouT
is released in https://pan.baidu.com/s/1kXgkyJ9(password: f2ul).
wordsim-240.txt
and wordsim-297.txt
in this files are utilized to evaluate the quality of word representations.
analogy.txt
in this file is utilized to evaluate models' capability of word analogy inference.
The annotation information is for the four files SSA.c/MST.c/SAC.c/SAT.c
. Annotation of the common code is only included in file SSA.c
.
I'm sorry that we found bugs in programs. We have revised them. The new experiment results are released on GitHub and new version of paper is given.
Model | Wordsim-240 | Wordsim-297 |
---|---|---|
CBOW | 57.7 | 61.1 |
GloVe | 59.8 | 58.7 |
Skip-gram | 58.5 | 63.3 |
SSA | 58.9 | 64.0 |
MST | 59.2 | 62.8 |
SAC | 59.1 | 61.0 |
SAT | 61.2 | 63.3 |
Model | Capital | City | Relationship | All |
---|---|---|---|---|
CBOW | 49.8 | 85.7 | 86.0 | 64.2 |
GloVe | 57.3 | 74.3 | 81.6 | 65.8 |
Skip-gram | 66.8 | 93.7 | 76.8 | 73.4 |
SSA | 62.3 | 93.7 | 81.6 | 71.9 |
MST | 65.7 | 95.4 | 82.7 | 74.5 |
SAC | 79.2 | 97.7 | 75.0 | 81.0 |
SAT | 82.6 | 98.9 | 80.1 | 84.5 |