Skip to content

Latest commit

 

History

History
143 lines (98 loc) · 6.37 KB

USAGE.md

File metadata and controls

143 lines (98 loc) · 6.37 KB

TGen -- Installation and Usage

Installation

TGen is written in Python (version 3.6-3.7 are supported). You can install it simply by cloning this repository, then installing all Python dependencies using pip:

git clone https://github.com/UFAL-DSG/tgen
cd tgen
pip install -r requirements.txt

We recommend using virtualenv to install all the required libraries. If your system Python version is different from 3.6-3.7, you can use Miniconda to get a minimal Python 3.7 environment.

To replicate most of the experiments in our papers, you will also need to install Treex (including the newest version from the Git repository as described in Step 5 of the Treex installation guide). It is, however, not needed for basic TGen functionality (without using deep syntactic trees).

Dependencies

Required Python modules (installed using pip and the requirements file):

Optional, manual installation (Perl code):

Additionally, some obsolete code depends on Theano, but it is currently not used and will be probably removed in the future.

Parallel training on the cluster is using SGE's qsub.

Usage (seq2seq-based generator only)

The main entry point run_tgen.py. The basic commands used for training and generating are:

./run_tgen.py seq2seq_train config-file.py train-das.txt train-text.txt model.pickle.gz
./run_tgen.py seq2seq_gen [-w out-text.txt] model.pickle.gz test-das.txt

You can run the program with seq2seq_train -h and seq2seq_gen -h to see more detailed options.

The file parameters for training are:

  • config-file.py -- a configuration file, containing a Python dictionary with all generator parameters. A default configuration file can be found in every experiment directory (see below).

  • train-das.txt -- training DAs, one DA per line (see below).

  • train-text.txt -- training natural language texts or trees (in a Treex YAML file) as example outputs for the generator (see below). Text files should contain one instance per line.

  • model.pickle.gz -- the output destination for the model. Note that several additional files with different extensions will be created.

The generation mode requires the model and a list of DAs, one per line. It can write the outputs into a text file (for direct string generation) or a Treex YAML file (for tree generation). The files are typically further post-processed (lexicalization, tree-to-string surface realization).

Data formats

The main data formats used by TGen are:

  • Dialogue Acts (DAs): The main input format into TGen are lists of triples of the shape (DA type, slot/attribute, value), e.g.: inform(food=Chinese)&inform(price=expensive). This easily maps on dialogue act representations used in various spoken dialogue systems. Conversion scripts are provided for several datasets (see below). DAs are delexicalized in a typical case.

  • Plain text: Outputs for direct string generation. Use one output sentence per line (no comments/empty lines allowed). For best results, delexicalize sparse values, such as restaurant/landmark names, time values etc. and fill them in in a postprocessing step.

  • Trees: Trees used for DA-to-tree generation are t-trees as produced by the Treex NLP system. We use YAML serialization produced by the Write::YAML Treex block. Installing Treex is necessary for any experiments involving generating trees.

Experiments

Our own experiments on several datasets are included as subdirectories within this repository:

You need to download the dataset into the input/ subdirectory to start your experiments. From there, a convert script (mostly convert.py, a Perl script convert.pl for BAGEL) can create the data formats required by TGen. Settings used in our experiments are preset in the Makefile, which, however, may contain site-specific code and need some tweaking to get working.

The default configuration file for each dataset is stored in the config/seq2seq.py file. This is typically the baseline variant, with improved versions requiring slight configuration changes.

The main experiment directory always has a basic experiment management in the Makefile, where make help can list the main commands. Note that some of the code in the Makefiles is also site-specific, especially all parts related to computing grid batch job submission, and requires some tweaking to get working. The code in the Makefile also assumes that Treex is installed.

If you need help running some of the experiments, feel free to contact me.