Format

The model needs base conversational corpus (base, e.g. Reddit) as well as a stylized corpus (bias, e.g. arXiv). The base corpus should be conversational (so base_conv), and the bias corpus doesn't have to be. so we only need bias_nonc (nonc means non-conversational)

To sum up, at least these files are required: base_conv_XXX.num, bias_nonc_XXX.num and vocab.txt, where XXX is train, vali, or test. See more discusion here

vocab.txt is the vocab list of tokens.
- The first three token must be _SOS_, _EOS_ and _UNK_, which represent "start of sentence", "end of sentence", and "unknown token".
- The line ID (starts from 1, 0 is reserved for padding) of vocab.txt is the token index used in *.num files. For examples, unknown tokens will be represented by 3 which is the token index of _UNK_.
*.num files are sentences (in form of seq of token index),
- for conv, each line is src \t tgt, where \t is the tab delimiter
- for nonc, each line is a sentence.

You may build a vocab using the build_vocab function to generate vocab.txt, and then convert a raw text files to *.num (e.g. train.txt to train.num) by the text2num function

Dataset

In our paper, we trained the model using the following three datasets.

Reddit: the conversational dataset (base_conv), can be generated using this script.
Sherlock Holmes, one of style dataset (bias_nonc), avaialble here
arXiv, another style corpus (bias_nonc), can be obtained following instructions here
A toy dataset is provied as an example following the format described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Format

Dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

Format

Dataset