TGen is written in Python (version 3.6-3.7 are supported). You can install it simply by cloning this repository, then installing all Python dependencies using pip:
git clone https://github.com/UFAL-DSG/tgen
cd tgen
pip install -r requirements.txt
We recommend using virtualenv to install all the required libraries. If your system Python version is different from 3.6-3.7, you can use Miniconda to get a minimal Python 3.7 environment.
To replicate most of the experiments in our papers, you will also need to install Treex (including the newest version from the Git repository as described in Step 5 of the Treex installation guide). It is, however, not needed for basic TGen functionality (without using deep syntactic trees).
Required Python modules (installed using pip and the requirements file):
- enum34
- numpy
- rpyc
- pudb
- recordclass
- TensorFlow, only version 1.13.1 is supported
- kenlm
- PyTreex
Optional, manual installation (Perl code):
Additionally, some obsolete code depends on Theano, but it is currently not used and will be probably removed in the future.
Parallel training on the cluster is using SGE's qsub
.
The main entry point run_tgen.py
. The basic commands used for training and generating are:
./run_tgen.py seq2seq_train config-file.py train-das.txt train-text.txt model.pickle.gz
./run_tgen.py seq2seq_gen [-w out-text.txt] model.pickle.gz test-das.txt
You can run the program with seq2seq_train -h
and seq2seq_gen -h
to see more detailed options.
The file parameters for training are:
-
config-file.py
-- a configuration file, containing a Python dictionary with all generator parameters. A default configuration file can be found in every experiment directory (see below). -
train-das.txt
-- training DAs, one DA per line (see below). -
train-text.txt
-- training natural language texts or trees (in a Treex YAML file) as example outputs for the generator (see below). Text files should contain one instance per line. -
model.pickle.gz
-- the output destination for the model. Note that several additional files with different extensions will be created.
The generation mode requires the model and a list of DAs, one per line. It can write the outputs into a text file (for direct string generation) or a Treex YAML file (for tree generation). The files are typically further post-processed (lexicalization, tree-to-string surface realization).
The main data formats used by TGen are:
-
Dialogue Acts (DAs): The main input format into TGen are lists of triples of the shape (DA type, slot/attribute, value), e.g.: inform(food=Chinese)&inform(price=expensive). This easily maps on dialogue act representations used in various spoken dialogue systems. Conversion scripts are provided for several datasets (see below). DAs are delexicalized in a typical case.
-
Plain text: Outputs for direct string generation. Use one output sentence per line (no comments/empty lines allowed). For best results, delexicalize sparse values, such as restaurant/landmark names, time values etc. and fill them in in a postprocessing step.
-
Trees: Trees used for DA-to-tree generation are t-trees as produced by the Treex NLP system. We use YAML serialization produced by the
Write::YAML
Treex block. Installing Treex is necessary for any experiments involving generating trees.
Our own experiments on several datasets are included as subdirectories within this repository:
-
alex-context/: our experiments on the Alex Context NLG Dataset (SIGDIAL 2016).
-
bagel-data/: our experiments on the BAGEL set (ACL 2015, 2016).
-
cs-restaurant/: generating for our Czech Restaurant NLG dataset.
-
e2e-challenge/: the baseline system for the E2E NLG Challenge. There are some more detailed usage instructions directly in the experiment subdirectory. These also partially apply to other datasets.
-
sfx-restaurant/: generating from the San Francisco Restaurants dataset collected by Wen et al., EMNLP 2015. There are some specific usage instructions directly in the experiment subdirectory.
You need to download the dataset into the input/
subdirectory to start your experiments.
From there, a convert
script (mostly convert.py
, a Perl script convert.pl
for BAGEL)
can create the data formats required by TGen. Settings used in our experiments are preset in the
Makefile
, which, however, may contain site-specific code and need some tweaking to get working.
The default configuration file for each dataset is stored in the config/seq2seq.py
file.
This is typically the baseline variant, with improved versions requiring slight configuration
changes.
The main experiment directory always has a basic experiment management in the Makefile
, where
make help
can list the main commands. Note that some of the code in the Makefiles is also
site-specific, especially all parts related to computing grid batch job submission, and requires
some tweaking to get working. The code in the Makefile
also assumes that Treex is installed.
If you need help running some of the experiments, feel free to contact me.