Name		Name	Last commit message	Last commit date
parent directory ..
config		config
input		input
postprocess		postprocess
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
learning_curve.pl		learning_curve.pl

README.md

TGen Baseline for the E2E NLG Challenge

To train and evaluate the baseline, you need to:

Convert the downloaded E2E data into a format used by TGen. This is done using the input/convert.py script.

Note that multiple references are joined for one MR in the development set, but kept separate for the training set (the -m switch). All files are plain text, one instance per line (except for multiple references, where instances are separated by empty lines).

The name and near slots in the MRs are delexicalized. The output files are:
- *-abst.txt -- lexicalization instructions (what was delexicalized at which position in the references, can be used to lexicalize the outputs)
- *-conc_das.txt -- original, lexicalized MRs (converted to TGen's representation, semantically equivalent)
- *-conc.txt -- original, lexicalized reference texts
- *-das.txt -- delexicalized MRs
- *-text.txt -- delexicalized reference texts
If you're using a test set file that only contains MRs and not references, use the --no-refs switch. This won't produce *-text.txt and *-conc.txt (since they wouldn't make sense anyway). You'll still have the *-das.txt and *-abst.txt files required for generation.

./convert.py -a name,near -n trainset.csv train
./convert.py -a name,near -n -m devset.csv devel
./convert.py -a name,near -n -m testset_with_refs.csv test
./convert.py -a name,near -n --no-refs testset_without_refs.csv test

Train TGen on the training set. This uses the default configuration file, the converted data, and the default random seed. It will save the model into model.pickle.gz (and several other files starting with model). Note that we used five different random seeds (-r s0, -r s1 ... -r s4), then picked the setup that was best on the development data

../run_tgen.py seq2seq_train config/config.yaml \
    input/train-das.txt input/train-text.txt \
    model.pickle.gz

The default configuration uses a small part of the training data for validation (early stopping if the performance goes down on that set). You can also opt to use the development set for validation (in that case, the validation set isn't “unseen” for the purposes of evaluation). If you want to use the development set during training, add -v input/devel-das.txt,input/devel-text.txt to the parameters (right after seq2seq_train).

Generate outputs on the development and test sets. This will also perform lexicalization of the outputs.

../run_tgen.py seq2seq_gen -w outputs-dev.txt -a input/devel-abst.txt \
    model.pickle.gz input/devel-das.txt
../run_tgen.py seq2seq_gen -w outputs-test.txt -a input/test-abst.txt \
    model.pickle.gz input/test-das.txt

Postprocess the outputs. This basically amounts to a simple detokenization. The script changes the outputs in-place, or you can specify a target file name.

./postprocess/postprocess.py outputs-dev.txt
./postprocess/postprocess.py outputs-test.txt

Remarks

Please refer to ../USAGE.md for TGen installation instructions.

The Makefile in this directory contains a simple experiment management system, but this assumes running on a SGE computing cluster and there are probably site-specific settings hardcoded. Please contact me if you want to use it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e-challenge

e2e-challenge

README.md

TGen Baseline for the E2E NLG Challenge

Remarks

Files

e2e-challenge

Directory actions

More options

Directory actions

More options

Latest commit

History

e2e-challenge

Folders and files

parent directory

README.md

TGen Baseline for the E2E NLG Challenge

Remarks