Word2vec is implemented in python3 and numpy. Gradient calculation and gradient ascent is written in python itself. It has following dependencies:
- python3
- numpy
- nltk
- pickle
- tqdm
- scipy
- operator
To Install Depedencies run the following command:
pip3 install -r requirement.txt
The Reuters dataset has 7k documents. All these documents are combined together to make a single document. The words in the document are lower cased. To perform data preprocessing run the following command:
python3 word2vec.py --tokenize True --build_dict True --tokenized_data_file tokenized_data_lemmatized.pkl
If you want to add lemmatization and remove the stopwords run the following command:
python3 word2vec.py --tokenize True --build_dict True --lemmatize True --tokenized_data_file tokenized_data_lemmatized.pkl --remove_stop_words True
The tokenized data will be saved in the file tokenized_data_lemmatized.pkl . You can also do stemming by switching --stemming True at run time.
To train the word2vec model run the following command:
python3 word2vec.py --lemmatize True --embedding_dim 50 --train_embeddings True --nb_negative_samples 10 --lr 0.05 --nb_epochs 14
Read word2vec.py to learn about more options.
To calculate Pearson cofficient run the following command:
python3 word2vec.py --pearson_cofficient True --word_model 'checkpoints/word_embeddings_10_50dim_with_lemma_v5.1.2.pkl' --context_model 'checkpoints/context_embedding_10_50dim_with_lemma_v5.1.2.pkl'
To get the K-nnearest neighbors of the test word, run the following code:
python3 word2vec.py --k_NN "Queen" --lemmatize True --k 10