Word2Vec implementation on Reuters corpus

Word2vec is implemented in python3 and numpy. Gradient calculation and gradient ascent is written in python itself. It has following dependencies:

python3
numpy
nltk
pickle
tqdm
scipy
operator

To Install Depedencies run the following command:

pip3 install -r requirement.txt

Data Preprocessing

The Reuters dataset has 7k documents. All these documents are combined together to make a single document. The words in the document are lower cased. To perform data preprocessing run the following command:

python3 word2vec.py --tokenize True --build_dict True --tokenized_data_file tokenized_data_lemmatized.pkl

If you want to add lemmatization and remove the stopwords run the following command:

python3 word2vec.py --tokenize True --build_dict True --lemmatize True --tokenized_data_file tokenized_data_lemmatized.pkl --remove_stop_words True

The tokenized data will be saved in the file tokenized_data_lemmatized.pkl . You can also do stemming by switching --stemming True at run time.

Word2vec Training

To train the word2vec model run the following command:

python3 word2vec.py --lemmatize True --embedding_dim 50 --train_embeddings True --nb_negative_samples 10 --lr 0.05 --nb_epochs 14

Read word2vec.py to learn about more options.

Pearson cofficient

To calculate Pearson cofficient run the following command:

python3 word2vec.py --pearson_cofficient True --word_model 'checkpoints/word_embeddings_10_50dim_with_lemma_v5.1.2.pkl' --context_model 'checkpoints/context_embedding_10_50dim_with_lemma_v5.1.2.pkl'

K-Neighbors

To get the K-nnearest neighbors of the test word, run the following code:

python3 word2vec.py --k_NN "Queen" --lemmatize True --k 10

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md
main.py		main.py
main2.py		main2.py
requirement.txt		requirement.txt
word2vec.py		word2vec.py
word2vec_report.pdf		word2vec_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word2Vec implementation on Reuters corpus

Data Preprocessing

Word2vec Training

Pearson cofficient

K-Neighbors

About

Releases

Packages

Languages

IISCAditayTripathi/NLU_word2vec

Folders and files

Latest commit

History

Repository files navigation

Word2Vec implementation on Reuters corpus

Data Preprocessing

Word2vec Training

Pearson cofficient

K-Neighbors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages