Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Reproducing the results with cross encoder on DSTC7/Ubuntu V2/Reddit #2974

Closed
luohongyin opened this issue Aug 12, 2020 · 16 comments
Closed
Assignees
Labels

Comments

@luohongyin
Copy link

luohongyin commented Aug 12, 2020

Hi, I'm trying to reproduce the performances of cross encoder reported in the paper "Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring."

I trained the model on 8 16GB GPUs with the following settings, from #2306 without poly encoder settings

parlai train_model 
    -t dstc7:DSTC7TeacherAugmentedSampled 
    --tensorboard-log True 
    --model-file model_file_dstc7/crossencoder** 
    --init-model zoo:pretrained_transformers/cross_model_huge_reddit/model 
    --batchsize 4 
    --model transformer/crossencoder 
    --warmup_updates 100 
    --lr-scheduler-patience 0 
    --lr-scheduler-decay 0.4 
    -lr 5e-05 
    --data-parallel True 
    --history-size 20 
    --label-truncate 72 
    --text-truncate 360 
    --validation-patience 5 
    --validation-every-n-epochs 0.5 
    --validation-metric accuracy 
    --validation-metric-mode max 
    --save-after-valid True 
    --log_every_n_secs 20 
    --candidates batch 
    --dict-tokenizer bpe  
    --dict-lower True 
    --optimizer adamax 
    --output-scaling 0.06 
    --variant xlm 
    --reduction_type mean 
    --share-encoders False 
    --learn-positional-embeddings True 
    --n-layers 12 
    --n-heads 12 
    --ffn-size 3072 
    --attention-dropout 0.1 
    --relu-dropout 0.0 
    --dropout 0.1 
    --n-positions 1024 
    --embedding-size 768 
    --activation gelu 
    --embeddings-scale False 
    --n-segments 2 
    --learn-embeddings True 
    --share-word-embeddings False 
    --dict-endtoken __start__ 
    --fp16 True

The model was trained for 110k steps but did not converge. The accuracy (hit@1) was 1.5%. Is there any example script for training a cross encoder on DSTC7? @klshuster

@klshuster
Copy link
Contributor

I would first compare some of your hyperparams to those listed under the "Cross-encoder" section at this page. What I can tell immediately is that you'll want to specify --candidates inline, as training with only 3 negatives (with --candidates batch --batchsize 4) will not yield adequate results

@klshuster klshuster self-assigned this Aug 12, 2020
@luohongyin
Copy link
Author

Thanks for replying! I'm using the cross encoder configurations from https://parl.ai/projects/polyencoder/, but the performance was not good (just 60.x%). Here are the training curves

image

image

My machine has 8 16G GPUs, but I have to set the batchsize as 2 if I do --candidate inline. The training lasts for more than 300k.

Here's my training script

parlai train_model \
  --init-model zoo:pretrained_transformers/cross_model_huge_reddit/model \
  -t dstc7 \
  --model transformer/rexencoder --batchsize 2 --eval-batchsize 4 --tensorboard-log True\
  --warmup_updates 1000 --lr-scheduler-patience 0 --lr-scheduler-decay 0.4 \
  -lr 5e-05 --data-parallel True --history-size 20 --label-truncate 72 \
  --text-truncate 360 --num-epochs 12.0 --max_train_time 200000 --validation-every-n-epochs 0.5\
  --validation-max-exs 2500 --validation-metric accuracy --validation-metric-mode max --fp16 true\
  --save-after-valid True --log_every_n_secs 20 --candidates inline \
  --dict-tokenizer bpe --dict-lower True --optimizer adamax --output-scaling 0.06 \
  --variant xlm --reduction-type first --share-encoders False \
  --learn-positional-embeddings True --n-layers 12 --n-heads 12 --ffn-size 3072 \
  --attention-dropout 0.1 --relu-dropout 0.0 --dropout 0.1 --n-positions 1024 \
  --embedding-size 768 --activation gelu --embeddings-scale False --n-segments 2 \
  --learn-embeddings True --dict-endtoken __start__ \
  --model-file model_file_dstc7/cedstc7

@klshuster
Copy link
Contributor

I have a few suggestions and also a few remarks:

  1. Make sure to compare your validation numbers to Table 10 in the poly-encoder paper (and note that validation results tend to be a bit lower than test results on DSTC7)
  2. For the cross-encoder, I recommend training on the -t dstc7:DSTC7TeacherAugmentedSampled task, as this fills in the negatives appropriately (for training with --candidates inline).
  3. Below is pretty much the exact model.opt file from our training run; feel free to compare with your hyperparams and adjust accordingly (i've tried to narrow down to relevant args):
{
  "datatype": "train:stream",
  "batchsize": 16,
  "model": "transformer/cross_encoder",
  "init_model": "zoo:pretrained_transformers/cross_model_huge_reddit/model",
  "eval_batchsize": 2,
  "num_epochs": 12,
  "max_train_time": 200000,
  "validation_every_n_epochs": 0.5,
  "validation_max_exs": 2500,
  "validation_patience": 10,
  "validation_metric": "accuracy",
  "validation_metric_mode": "max",
  "task": "dstc7:DSTC7TeacherAugmentedSampled",
  "log_every_n_secs": 20,
  "fp16": true,
  "optimizer": "adamax",
  "learningrate": 5e-05,
  "gradient_clip": 0.1,
  "momentum": 0,
  "nesterov": true,
  "nus": [
    0.7
  ],
  "betas": [
    0.9,
    0.999
  ],
  "lr_scheduler": "reduceonplateau",
  "lr_scheduler_patience": 0,
  "lr_scheduler_decay": 0.4,
  "warmup_updates": 1000,
  "warmup_rate": 0.0001,
  "text_truncate": 360,
  "label_truncate": 72,
  "history_size": 20,
  "candidates": "inline",
  "eval_candidates": "inline",
  "embedding_size": 768,
  "output_scaling": 0.13,
}

Perhaps try with --output-scaling 0.13 as well to see if there is a difference? We generally found 0.06 to work better, but could be worth a try. The above model achieved 63.6% accuracy on the dstc7 dev set.

@luohongyin
Copy link
Author

Thank you! Could you also let me know how to get the performances on the test split?

@klshuster
Copy link
Contributor

You should see test results at the end of training, however you can always evaluate your model with parlai eval_model --model-file /path/to/saved/model -t dstc7 --datatype test

@luohongyin
Copy link
Author

Thanks! It seems that 60%+ on dev set is a reasonable performance, so I'll temporally close this issue and continue hyper-parameter tuning. Thank you for your kind assistance!

@luohongyin
Copy link
Author

In the cross-encoder section of the poly-encoder paper, it says

"We thus limit its batch size to 16 and provide negatives random samples from the training set. For DSTC7 and Ubuntu V2, we choose 15 such negatives; For ConvAI2, the dataset provides 19 negatives."

It seems that the cross-encoder in the paper uses --candidates batch. However, I frequently encounter the Ran out of memory, skipping batch error while training the "batch" models with 8 32G GPUs using the same batch size in the paper (16).

Could you let me know

  • if it's normal to have (many) ran out of memory issues, and
  • is there any performance gap between using batch negative samples or inline negative samples?

Thanks!

@klshuster
Copy link
Contributor

This sentence is actually just saying that we use batchsize of 16 regardless of dataset, but we still work with inline candidates. There should not be any performance gap in using batch negatives vs. inline negatives.

Using --candidates inline allows you to specify a smaller batchsize if you continue to run out of memory, while still maintaining the same number of negative candidates.

@luohongyin
Copy link
Author

Sounds great, thank you!

@luohongyin luohongyin changed the title Reproducing the results with cross encoder on DSTC7 Reproducing the results with cross encoder on DSTC7 and Ubuntu V2 Aug 18, 2020
@luohongyin luohongyin reopened this Aug 18, 2020
@luohongyin
Copy link
Author

Sorry for the frequent question - I have successfully reproduced the experiments on DSTC7 and plan to move on to experiments on Ubuntu V2. However, the format of Ubuntu V2 training data is a .CSV file with heads Context,Utterance,Label. and --candidates inline does not work.

I wonder if the torch_rankder_agent still works for Ubuntu V2. Should I set --candidates batch?

It would also be very helpful if you could share your Ubuntu V2 training settings, if you can still find them. Many thanks!

@klshuster
Copy link
Contributor

The training settings should be similar if not the same as the ones listed above for dstc7.

For our experiments we wrote an augmented teacher that aggregated all the labels in the training set and randomly sampled 15 negatives to put in the label_candidates field (in addition to the true label) for each example in order to train with --candidates inline

@luohongyin
Copy link
Author

Thank you!

@luohongyin luohongyin changed the title Reproducing the results with cross encoder on DSTC7 and Ubuntu V2 Reproducing the results with cross encoder on DSTC7/Ubuntu V2/Reddit Aug 27, 2020
@luohongyin luohongyin reopened this Aug 27, 2020
@luohongyin
Copy link
Author

Thanks for your help on DSTC7 and Ubuntu! Could let me know if it's possible to train my own cross encoder model on the "Reddit huge" data? If I can do that, what's the best method & setting to do that?

@klshuster
Copy link
Contributor

We don't distribute the reddit data, but you can download from pushshift.io and process it yourself following the instructions in https://arxiv.org/abs/1809.01984, then train as specified in the poly-encoder paper (and also in the linked paper)

@github-actions
Copy link

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

@github-actions github-actions bot added the stale label Sep 27, 2020
@klshuster
Copy link
Contributor

closing for now, please reopen if there are further issues

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants