Skip to content

IISCAditayTripathi/NLU_word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word2Vec implementation on Reuters corpus

Word2vec is implemented in python3 and numpy. Gradient calculation and gradient ascent is written in python itself. It has following dependencies:

  • python3
  • numpy
  • nltk
  • pickle
  • tqdm
  • scipy
  • operator

To Install Depedencies run the following command:

pip3 install -r requirement.txt

Data Preprocessing

The Reuters dataset has 7k documents. All these documents are combined together to make a single document. The words in the document are lower cased. To perform data preprocessing run the following command:

python3 word2vec.py --tokenize True --build_dict True --tokenized_data_file tokenized_data_lemmatized.pkl

If you want to add lemmatization and remove the stopwords run the following command:

python3 word2vec.py --tokenize True --build_dict True --lemmatize True --tokenized_data_file tokenized_data_lemmatized.pkl --remove_stop_words True

The tokenized data will be saved in the file tokenized_data_lemmatized.pkl . You can also do stemming by switching --stemming True at run time.

Word2vec Training

To train the word2vec model run the following command:

python3 word2vec.py --lemmatize True --embedding_dim 50 --train_embeddings True --nb_negative_samples 10 --lr 0.05 --nb_epochs 14

Read word2vec.py to learn about more options.

Pearson cofficient

To calculate Pearson cofficient run the following command:

python3 word2vec.py --pearson_cofficient True --word_model 'checkpoints/word_embeddings_10_50dim_with_lemma_v5.1.2.pkl' --context_model 'checkpoints/context_embedding_10_50dim_with_lemma_v5.1.2.pkl'

K-Neighbors

To get the K-nnearest neighbors of the test word, run the following code:

python3 word2vec.py --k_NN "Queen" --lemmatize True --k 10

About

Word2vec implementation in python and numpy.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages