Skip to content

Latest commit

 

History

History
30 lines (21 loc) · 2.21 KB

README.md

File metadata and controls

30 lines (21 loc) · 2.21 KB

Format

The model needs base conversational corpus (base, e.g. Reddit) as well as a stylized corpus (bias, e.g. arXiv). The base corpus should be conversational (so base_conv), and the bias corpus doesn't have to be. so we only need bias_nonc (nonc means non-conversational)

To sum up, at least these files are required: base_conv_XXX.num, bias_nonc_XXX.num and vocab.txt, where XXX is train, vali, or test. See more discusion here

  • vocab.txt is the vocab list of tokens.

    • The first three token must be _SOS_, _EOS_ and _UNK_, which represent "start of sentence", "end of sentence", and "unknown token".
    • The line ID (starts from 1, 0 is reserved for padding) of vocab.txt is the token index used in *.num files. For examples, unknown tokens will be represented by 3 which is the token index of _UNK_.
  • *.num files are sentences (in form of seq of token index),

    • for conv, each line is src \t tgt, where \t is the tab delimiter
    • for nonc, each line is a sentence.

You may build a vocab using the build_vocab function to generate vocab.txt, and then convert a raw text files to *.num (e.g. train.txt to train.num) by the text2num function

Dataset

In our paper, we trained the model using the following three datasets.

  • Reddit: the conversational dataset (base_conv), can be generated using this script.
  • Sherlock Holmes, one of style dataset (bias_nonc), avaialble here
  • arXiv, another style corpus (bias_nonc), can be obtained following instructions here
  • A toy dataset is provied as an example following the format described above.