Skip to content

Latest commit

 

History

History
64 lines (44 loc) · 2.25 KB

README.md

File metadata and controls

64 lines (44 loc) · 2.25 KB

💬 DiDi: Dialog Diffusion model

Main Code style: black Checked with mypy

Work in progress.

Structure

  • src ‒ main source code with model and dataset implementations and code to train, test or infer model.
  • notebooks ‒ notebooks with experiments and visualizations.
  • scripts ‒ different useful scripts, e.g. print dataset examples or evaluate existing models.
  • tests ‒ unit tests.

Requirements

Create virtual environment with venv or conda and install requirements:

pip install -r requirements.txt

For proper contributions, also use dev requirements:

pip install -r requirements-dev.txt

Data

Pretrain

We use pushshift.io dataset with Reddits' comments to pretrain our model. We have collected all the comments for 2019.

TODO: add preprocessing and filter steps

A total of 237.212.662 dialogs. 237.162.662 are used for train split, 25.000 each are used for validation and test splits.

CommonSense

CommonSense Conversation from DiffuSeq Token statistic collected w/ facebook/blenderbot-400M-distill tokenizer, see scripts.cc_tokens_stats.

Train

  • 3.382.137 samples
  • Context contains 81.772.641 tokens in range 2-84, average 24.178
  • Target contains 80.812.361 tokens in range 1-84, average 23.894

Valid

  • 2.047 samples
  • Context contains 49.424 tokens in range 3-53, average 24.133
  • Target contains 49.887 tokens in range 2-56, average 24.359

Test

  • 9.999 samples
  • Context contains 241.541 tokens in range 2-58, average 24.154
  • Target contains 240.374 tokens in range 2-61, average 24.037

Fine-tuning

We use the ConvAI2 Dataset containing dialogues between personas with different descriptive profiles. The dataset can be downloaded here.