Reproducing the results with cross encoder on DSTC7/Ubuntu V2/Reddit #2974

luohongyin · 2020-08-12T20:08:12Z

Hi, I'm trying to reproduce the performances of cross encoder reported in the paper "Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring."

I trained the model on 8 16GB GPUs with the following settings, from #2306 without poly encoder settings

parlai train_model 
    -t dstc7:DSTC7TeacherAugmentedSampled 
    --tensorboard-log True 
    --model-file model_file_dstc7/crossencoder** 
    --init-model zoo:pretrained_transformers/cross_model_huge_reddit/model 
    --batchsize 4 
    --model transformer/crossencoder 
    --warmup_updates 100 
    --lr-scheduler-patience 0 
    --lr-scheduler-decay 0.4 
    -lr 5e-05 
    --data-parallel True 
    --history-size 20 
    --label-truncate 72 
    --text-truncate 360 
    --validation-patience 5 
    --validation-every-n-epochs 0.5 
    --validation-metric accuracy 
    --validation-metric-mode max 
    --save-after-valid True 
    --log_every_n_secs 20 
    --candidates batch 
    --dict-tokenizer bpe  
    --dict-lower True 
    --optimizer adamax 
    --output-scaling 0.06 
    --variant xlm 
    --reduction_type mean 
    --share-encoders False 
    --learn-positional-embeddings True 
    --n-layers 12 
    --n-heads 12 
    --ffn-size 3072 
    --attention-dropout 0.1 
    --relu-dropout 0.0 
    --dropout 0.1 
    --n-positions 1024 
    --embedding-size 768 
    --activation gelu 
    --embeddings-scale False 
    --n-segments 2 
    --learn-embeddings True 
    --share-word-embeddings False 
    --dict-endtoken __start__ 
    --fp16 True

The model was trained for 110k steps but did not converge. The accuracy (hit@1) was 1.5%. Is there any example script for training a cross encoder on DSTC7? @klshuster

The text was updated successfully, but these errors were encountered:

klshuster · 2020-08-12T20:16:24Z

I would first compare some of your hyperparams to those listed under the "Cross-encoder" section at this page. What I can tell immediately is that you'll want to specify --candidates inline, as training with only 3 negatives (with --candidates batch --batchsize 4) will not yield adequate results

luohongyin · 2020-08-14T15:21:59Z

Thanks for replying! I'm using the cross encoder configurations from https://parl.ai/projects/polyencoder/, but the performance was not good (just 60.x%). Here are the training curves

My machine has 8 16G GPUs, but I have to set the batchsize as 2 if I do --candidate inline. The training lasts for more than 300k.

Here's my training script

parlai train_model \
  --init-model zoo:pretrained_transformers/cross_model_huge_reddit/model \
  -t dstc7 \
  --model transformer/rexencoder --batchsize 2 --eval-batchsize 4 --tensorboard-log True\
  --warmup_updates 1000 --lr-scheduler-patience 0 --lr-scheduler-decay 0.4 \
  -lr 5e-05 --data-parallel True --history-size 20 --label-truncate 72 \
  --text-truncate 360 --num-epochs 12.0 --max_train_time 200000 --validation-every-n-epochs 0.5\
  --validation-max-exs 2500 --validation-metric accuracy --validation-metric-mode max --fp16 true\
  --save-after-valid True --log_every_n_secs 20 --candidates inline \
  --dict-tokenizer bpe --dict-lower True --optimizer adamax --output-scaling 0.06 \
  --variant xlm --reduction-type first --share-encoders False \
  --learn-positional-embeddings True --n-layers 12 --n-heads 12 --ffn-size 3072 \
  --attention-dropout 0.1 --relu-dropout 0.0 --dropout 0.1 --n-positions 1024 \
  --embedding-size 768 --activation gelu --embeddings-scale False --n-segments 2 \
  --learn-embeddings True --dict-endtoken __start__ \
  --model-file model_file_dstc7/cedstc7

klshuster · 2020-08-14T16:41:11Z

I have a few suggestions and also a few remarks:

Make sure to compare your validation numbers to Table 10 in the poly-encoder paper (and note that validation results tend to be a bit lower than test results on DSTC7)
For the cross-encoder, I recommend training on the -t dstc7:DSTC7TeacherAugmentedSampled task, as this fills in the negatives appropriately (for training with --candidates inline).
Below is pretty much the exact model.opt file from our training run; feel free to compare with your hyperparams and adjust accordingly (i've tried to narrow down to relevant args):

{
  "datatype": "train:stream",
  "batchsize": 16,
  "model": "transformer/cross_encoder",
  "init_model": "zoo:pretrained_transformers/cross_model_huge_reddit/model",
  "eval_batchsize": 2,
  "num_epochs": 12,
  "max_train_time": 200000,
  "validation_every_n_epochs": 0.5,
  "validation_max_exs": 2500,
  "validation_patience": 10,
  "validation_metric": "accuracy",
  "validation_metric_mode": "max",
  "task": "dstc7:DSTC7TeacherAugmentedSampled",
  "log_every_n_secs": 20,
  "fp16": true,
  "optimizer": "adamax",
  "learningrate": 5e-05,
  "gradient_clip": 0.1,
  "momentum": 0,
  "nesterov": true,
  "nus": [
    0.7
  ],
  "betas": [
    0.9,
    0.999
  ],
  "lr_scheduler": "reduceonplateau",
  "lr_scheduler_patience": 0,
  "lr_scheduler_decay": 0.4,
  "warmup_updates": 1000,
  "warmup_rate": 0.0001,
  "text_truncate": 360,
  "label_truncate": 72,
  "history_size": 20,
  "candidates": "inline",
  "eval_candidates": "inline",
  "embedding_size": 768,
  "output_scaling": 0.13,
}

Perhaps try with --output-scaling 0.13 as well to see if there is a difference? We generally found 0.06 to work better, but could be worth a try. The above model achieved 63.6% accuracy on the dstc7 dev set.

luohongyin · 2020-08-14T17:47:05Z

Thank you! Could you also let me know how to get the performances on the test split?

klshuster · 2020-08-14T20:17:09Z

You should see test results at the end of training, however you can always evaluate your model with parlai eval_model --model-file /path/to/saved/model -t dstc7 --datatype test

luohongyin · 2020-08-15T18:47:05Z

Thanks! It seems that 60%+ on dev set is a reasonable performance, so I'll temporally close this issue and continue hyper-parameter tuning. Thank you for your kind assistance!

luohongyin · 2020-08-18T14:58:11Z

In the cross-encoder section of the poly-encoder paper, it says

"We thus limit its batch size to 16 and provide negatives random samples from the training set. For DSTC7 and Ubuntu V2, we choose 15 such negatives; For ConvAI2, the dataset provides 19 negatives."

It seems that the cross-encoder in the paper uses --candidates batch. However, I frequently encounter the Ran out of memory, skipping batch error while training the "batch" models with 8 32G GPUs using the same batch size in the paper (16).

Could you let me know

if it's normal to have (many) ran out of memory issues, and
is there any performance gap between using batch negative samples or inline negative samples?

Thanks!

klshuster · 2020-08-18T16:41:48Z

This sentence is actually just saying that we use batchsize of 16 regardless of dataset, but we still work with inline candidates. There should not be any performance gap in using batch negatives vs. inline negatives.

Using --candidates inline allows you to specify a smaller batchsize if you continue to run out of memory, while still maintaining the same number of negative candidates.

luohongyin · 2020-08-18T17:41:17Z

Sounds great, thank you!

luohongyin · 2020-08-18T18:34:28Z

Sorry for the frequent question - I have successfully reproduced the experiments on DSTC7 and plan to move on to experiments on Ubuntu V2. However, the format of Ubuntu V2 training data is a .CSV file with heads Context,Utterance,Label. and --candidates inline does not work.

I wonder if the torch_rankder_agent still works for Ubuntu V2. Should I set --candidates batch?

It would also be very helpful if you could share your Ubuntu V2 training settings, if you can still find them. Many thanks!

klshuster · 2020-08-18T19:12:08Z

The training settings should be similar if not the same as the ones listed above for dstc7.

For our experiments we wrote an augmented teacher that aggregated all the labels in the training set and randomly sampled 15 negatives to put in the label_candidates field (in addition to the true label) for each example in order to train with --candidates inline

luohongyin · 2020-08-20T19:22:32Z

Thank you!

luohongyin · 2020-08-27T02:43:27Z

Thanks for your help on DSTC7 and Ubuntu! Could let me know if it's possible to train my own cross encoder model on the "Reddit huge" data? If I can do that, what's the best method & setting to do that?

klshuster · 2020-08-27T16:50:20Z

We don't distribute the reddit data, but you can download from pushshift.io and process it yourself following the instructions in https://arxiv.org/abs/1809.01984, then train as specified in the poly-encoder paper (and also in the linked paper)

github-actions · 2020-09-27T00:25:07Z

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

klshuster · 2020-09-28T17:35:58Z

closing for now, please reopen if there are further issues

klshuster self-assigned this Aug 12, 2020

luohongyin closed this as completed Aug 15, 2020

luohongyin reopened this Aug 18, 2020

luohongyin closed this as completed Aug 18, 2020

luohongyin changed the title ~~Reproducing the results with cross encoder on DSTC7~~ Reproducing the results with cross encoder on DSTC7 and Ubuntu V2 Aug 18, 2020

luohongyin reopened this Aug 18, 2020

luohongyin closed this as completed Aug 20, 2020

luohongyin changed the title ~~Reproducing the results with cross encoder on DSTC7 and Ubuntu V2~~ Reproducing the results with cross encoder on DSTC7/Ubuntu V2/Reddit Aug 27, 2020

luohongyin reopened this Aug 27, 2020

github-actions bot added the stale label Sep 27, 2020

klshuster closed this as completed Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing the results with cross encoder on DSTC7/Ubuntu V2/Reddit #2974

Reproducing the results with cross encoder on DSTC7/Ubuntu V2/Reddit #2974

luohongyin commented Aug 12, 2020 •

edited

Loading

klshuster commented Aug 12, 2020

luohongyin commented Aug 14, 2020

klshuster commented Aug 14, 2020

luohongyin commented Aug 14, 2020

klshuster commented Aug 14, 2020

luohongyin commented Aug 15, 2020

luohongyin commented Aug 18, 2020

klshuster commented Aug 18, 2020

luohongyin commented Aug 18, 2020

luohongyin commented Aug 18, 2020

klshuster commented Aug 18, 2020

luohongyin commented Aug 20, 2020

luohongyin commented Aug 27, 2020

klshuster commented Aug 27, 2020

github-actions bot commented Sep 27, 2020

klshuster commented Sep 28, 2020

Reproducing the results with cross encoder on DSTC7/Ubuntu V2/Reddit #2974

Reproducing the results with cross encoder on DSTC7/Ubuntu V2/Reddit #2974

Comments

luohongyin commented Aug 12, 2020 • edited Loading

klshuster commented Aug 12, 2020

luohongyin commented Aug 14, 2020

klshuster commented Aug 14, 2020

luohongyin commented Aug 14, 2020

klshuster commented Aug 14, 2020

luohongyin commented Aug 15, 2020

luohongyin commented Aug 18, 2020

klshuster commented Aug 18, 2020

luohongyin commented Aug 18, 2020

luohongyin commented Aug 18, 2020

klshuster commented Aug 18, 2020

luohongyin commented Aug 20, 2020

luohongyin commented Aug 27, 2020

klshuster commented Aug 27, 2020

github-actions bot commented Sep 27, 2020

klshuster commented Sep 28, 2020

luohongyin commented Aug 12, 2020 •

edited

Loading