Skip to content
forked from sordonia/hed-dlg

Hierarchical Encoder Decoder for Dialog Modelling

License

Notifications You must be signed in to change notification settings

julianser/hed-dlg

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hed-dlg

Hierarchical Encoder Decoder for Dialog Modelling

Creating A Dataset

The script convert-text2dict.py can be used to generate model datasets based on text files with dialogues. It is assumed that each dialogue consists of three turns: A-B-A.

Prepare your dataset as a text file for with one dialogue (one triple) per line. There must be exactly three utterances in each dialogue, and they must be separated by the tab symbol. There must no be any tab symbols elsewhwere in the file. The dialogues are assumed to be tokenized. If you have a validation and tests sets, they must satisfy the same requirements.

Once you're ready, you can create the model dataset files by running:

python convert-text2dict.py <training_file> --cutoff <vocabulary_size> Training python convert-text2dict.py <validation_file> --dict=Training.dict.pkl Validation python convert-text2dict.py <test_file> --dict=Training.dict.pkl <vocabulary_size> Test

where <training_file> is the training file, and <vocabulary_size> is the number of tokens that you want to train on (all other tokens will be converted to symbols).

Training The Model

If you have Theano with GPU installed (bleeding edge version), you can train the model as follows:

  1. Clone the Github repository
  2. Create a new "Output" and "Data" directories inside it.
  3. Unpack your dataset files into "Data" directory.
  4. Create a new prototype inside state.py (look at prototype_moviedic or prototype_test as examples)
  5. From the terminal, cd into the code directory and run:

THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python train.py --prototype <prototype_name> &> Model_Output.txt

For a 13M word dataset, such as MovieTriples, this takes about 1-2 days until it reaches convergence.

To test the model afterwards, you can run:

THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python evaluate.py --exclude-sos --plot-graphs Output/<model_name> --document_ids Data/Test_Shuffled_Dataset_Labels.txt &> Model_Evaluation.txt

where <model_name> is the name automatically generated by train.py.

If your GPU runs out of memory, you can adjust the bs (batch size) parameter inside the state.py, but training will be slower. You can also play around with the other parameters inside state.py.

About

Hierarchical Encoder Decoder for Dialog Modelling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.9%
  • Java 6.1%