Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to obttain the results written in the paper #6

Open
yutxie opened this issue Dec 20, 2018 · 10 comments
Open

Unable to obttain the results written in the paper #6

yutxie opened this issue Dec 20, 2018 · 10 comments
Assignees
Labels

Comments

@yutxie
Copy link

yutxie commented Dec 20, 2018

I've tried on your relased code of baselines, but there were some differences between the results tested on the validation set and reported in your paper.

experiment, CoLA(mcc), SST-2, MRPC(acc/f1), QQP(acc/f1), STS-B(pear/spear), MNLI(m/mm), QNLI, RTE, WNLI
your paper, 240, 858, 719/821, 802/591, 688/670, 658/660, 711, 468, 637
my result, 125, 870, 740/829, 794/735, 726/726, 599/605, 584, 574, 141

Both employ the basic BiLSTM model, and follows the MTL setting
You can see there is a huge gap between the results of CoLA(24.0, 12.5) and WNLI(63.7, 14.1)

Here are my hyperparameter settings, could you please help me to check if I'm using the same as yours? This is based on run_staff.sh

GPUID=0
train_tasks='all' # 'all', 'none'
original_model_code=1 # 1 if use the original models.py
single_encoder=1 # never mind if original_model_code=1

SHOULD_TRAIN=1
SHOULD_TEST=0
LOAD_MODEL=0
LOAD_TASKS=1
LOAD_PREPROC=1
load_epoch=-1

SCRATCH_PREFIX='.'
EXP_NAME="preprocess"
RUN_NAME="results/original"
SEED=19
no_tqdm=0

eval_tasks='none'
CLASSIFIER=mlp
d_hid_cls=512
max_seq_len=40
VOCAB_SIZE=30000
WORD_EMBS_FILE="${SCRATCH_PREFIX}/embeddings/glove.840B.300d.txt"

d_word=300
d_hid=1500
glove=1
ELMO=0
deep_elmo=0
elmo_no_glove=0
COVE=0

PAIR_ENC="simple"
N_LAYERS_ENC=2
n_layers_highway=0

OPTIMIZER="adam"
LR=1e-3
min_lr=1e-5
dropout=.2
LR_DECAY=.2
patience=5
task_patience=0
train_words=0
WEIGHT_DECAY=0.0
SCHED_THRESH=0.0
BATCH_SIZE=128
BPP_METHOD="percent_tr"
BPP_BASE=10
VAL_INTERVAL=10000 # also the epoch_size, default=10
MAX_VALS=100
TASK_ORDERING="random"
weighting_method="uniform"
scaling_method='none'

while getopts 'ikmn:r:S:s:tvh:l:L:o:T:E:O:b:H:p:edcgP:qB:V:M:D:C:X:GI:N:y:K:W:' flag; do
    case "${flag}" in
        P) SCRATCH_PREFIX="${OPTARG}" ;;
        n) EXP_NAME="${OPTARG}" ;;
        r) RUN_NAME="${OPTARG}" ;;
        S) SEED="${OPTARG}" ;;
        q) no_tqdm=1 ;;
        t) SHOULD_TRAIN=0 ;;
        k) LOAD_TASKS=0 ;;
        m) LOAD_MODEL=1 ;;
        i) LOAD_PREPROC=0 ;;
        M) BPP_METHOD="${OPTARG}" ;; 
        B) BPP_BASE="${OPTARG}" ;;
        V) VAL_INTERVAL="${OPTARG}" ;;
        X) MAX_VALS="${OPTARG}" ;;
        T) train_tasks="${OPTARG}" ;;
        #E) eval_tasks="${OPTARG}" ;;
        O) TASK_ORDERING="${OPTARG}" ;;
        H) n_layers_highway="${OPTARG}" ;;
        l) LR="${OPTARG}" ;;
        #s) min_lr="${OPTARG}" ;;
        L) N_LAYERS_ENC="${OPTARG}" ;;
        o) OPTIMIZER="${OPTARG}" ;;
        h) d_hid="${OPTARG}" ;;
        b) BATCH_SIZE="${OPTARG}" ;;
        E) PAIR_ENC="${OPTARG}" ;;
        G) glove=0 ;;
        e) ELMO=1 ;;
        d) deep_elmo=1 ;;
        g) elmo_no_glove=1 ;;
        c) COVE=1 ;;
        D) dropout="${OPTARG}" ;;
        C) CLASSIFIER="${OPTARG}" ;;
        I) GPUID="${OPTARG}" ;;
        N) load_epoch="${OPTARG}" ;;
        y) LR_DECAY="${OPTARG}" ;;
        K) task_patience="${OPTARG}" ;;
        p) patience="${OPTARG}" ;;
        W) weighting_method="${OPTARG}" ;;
        s) scaling_method="${OPTARG}" ;;
    esac
done

LOG_PATH="${SCRATCH_PREFIX}/${RUN_NAME}/log.log"
EXP_DIR="${SCRATCH_PREFIX}/${EXP_NAME}/"
RUN_DIR="${SCRATCH_PREFIX}/${RUN_NAME}/"
mkdir -p ${EXP_DIR}
mkdir -p ${RUN_DIR}

ALLEN_CMD="python src/main.py --cuda ${GPUID} --random_seed ${SEED} --no_tqdm ${no_tqdm} --log_file ${LOG_PATH} --exp_dir ${EXP_DIR} --run_dir ${RUN_DIR} --train_tasks ${train_tasks} --eval_tasks ${eval_tasks} --classifier ${CLASSIFIER} --classifier_hid_dim ${d_hid_cls} --max_seq_len ${max_seq_len} --max_word_v_size ${VOCAB_SIZE} --word_embs_file ${WORD_EMBS_FILE} --train_words ${train_words} --glove ${glove} --elmo ${ELMO} --deep_elmo ${deep_elmo} --elmo_no_glove ${elmo_no_glove} --cove ${COVE} --d_word ${d_word} --d_hid ${d_hid} --n_layers_enc ${N_LAYERS_ENC} --pair_enc ${PAIR_ENC} --n_layers_highway ${n_layers_highway} --batch_size ${BATCH_SIZE} --bpp_method ${BPP_METHOD} --bpp_base ${BPP_BASE} --optimizer ${OPTIMIZER} --lr ${LR} --min_lr ${min_lr} --lr_decay_factor ${LR_DECAY} --task_patience ${task_patience} --patience ${patience} --weight_decay ${WEIGHT_DECAY} --dropout ${dropout} --val_interval ${VAL_INTERVAL} --max_vals ${MAX_VALS} --task_ordering ${TASK_ORDERING} --weighting_method ${weighting_method} --scaling_method ${scaling_method} --scheduler_threshold ${SCHED_THRESH} --load_model ${LOAD_MODEL} --load_tasks ${LOAD_TASKS} --load_preproc ${LOAD_PREPROC} --should_train ${SHOULD_TRAIN} --should_test ${SHOULD_TEST} --load_epoch ${load_epoch}"
eval ${ALLEN_CMD}

BTW, this is how I test on the validation set. The code is based on eval_test.py and main.py

import os
import json
import ipdb as pdb
import numpy as np

from sklearn.metrics import matthews_corrcoef, f1_score
from scipy.stats import pearsonr, spearmanr
from allennlp.data.dataset import Batch

def evaluate_val(tasks, val_preds):
    for eval_task, task_preds in val_preds.items(): # write predictions for each task
        #if 'mnli' not in eval_task:
        #    continue
        task = [task for task in tasks if task.name == eval_task][0]
        preds = task_preds[0]
        val_data = Batch(task.val_data).as_tensor_dict()
        golds = val_data['label']
        assert len(preds) == len(golds)
        if 'mnli' in eval_task:
            # matched
            evaluate('mnli-m', preds[:9815], golds[:9815])
            # mismatched
            evaluate('mnli-mm', preds[9815:9815+9832], golds[9815:9815+9832])
        else:
            metrics = ['acc']
            if 'cola' in eval_task:
                metrics = ['matthews']
            if 'mrpc' in eval_task or 'qqp' in eval_task:
                metrics = ['acc', 'f1']
            if 'sts' in eval_task:
                golds = golds * 5.
                metrics = ['corr']
            evaluate(eval_task, golds, preds, metrics)


def evaluate(task_name, golds, preds, metrics=['acc']):
    assert len(golds) == len(preds)
    print('***************************** %s:' % task_name)
    if 'acc' in metrics:
        acc = sum([1 for gold, pred in zip(golds, preds) if gold == pred]) / float(len(golds))
        print("acc: %.3f" % acc)
    if 'f1' in metrics:
        f1 = f1_score(golds, preds)
        print("f1: %.3f" % f1)
    if 'matthews' in metrics:
        mcc = matthews_corrcoef(golds, preds)
        print("mcc: %.3f" % mcc)
    if 'corr' in metrics:
        golds = np.asarray(golds).reshape(-1)
        preds = np.asarray(preds).reshape(-1)
        corr = pearsonr(golds, preds)[0]
        print("pearson r: %.3f" % corr)
        corr = spearmanr(golds, preds)[0]
        print("spearman r: %.3f" % corr)

Thanks!

@sleepinyourhat
Copy link
Contributor

Hi!

The results in the paper are test set results (as it says in the caption), and several datasets have non-trivial differences between the dev and test data, so it's possible that you've already reproduced our results exactly.

In any case, though, I'd urge you to use the newer jiant codebase. It's much better documented, and gets strictly better results than the baselines here. We don't have public dev set numbers from that codebase yet, but if you post an issue there, we should be able to assemble some.

https://github.com/jsalt18-sentence-repl/jiant

If you do need to use this codebase, reply here and @W4ngatang should be able to share the exact hyperparameters we used.

@yutxie
Copy link
Author

yutxie commented Dec 21, 2018

Thanks for your enthusiastic reply!

I've submitted it to the GLUE platform, but there are still some gaps in CoLA, QNLI and WNLI.

  | average | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI | QNLI | RTE | WNLI
BiLSTM baseline | 63.5 | 24 | 85.8 | 82.1/71.9 | 68.8/67.0 | 59.1/80.2 | 65.8/66 | 71.1 | 46.8 | 63.7
my results | 60.4 | 13.9 | 84.6 | 81.6/73.0 | 68.8/66.7 | 57.2/79.7 | 61.3/61.8 | 63 | 54.2 | 52.7

So it will be very nice of you to offer me the hyperparameters which produces the baselines on this codebase.

Besides, I'm willing to transfer to jiant, but I'm not sure whether I can produce the GLUE baselines with it. Can I obtain the results on the leaderboard using the final_glue_runs.sh script without modification?

Thanks again!

@sleepinyourhat
Copy link
Contributor

@W4ngatang - Could you take this one?

If you need to exactly match our baselines, jiant won't do that. This paper publishes numbers from the final_glue_runs script, though: https://openreview.net/pdf?id=Bkl87h09FX

Sam

@W4ngatang
Copy link
Collaborator

Hey @xxxxxyt , I've added the exact scripts that I'm running here. Could you try running those?

@Bogerchen
Copy link

Hey, after fixing lots of issues, I tried running the code. However, I still get the following error:

Traceback (most recent call last):
File "src/main.py", line 280, in
sys.exit(main(sys.argv[1:]))
File "src/main.py", line 186, in main
trainer = MultiTaskTrainer.from_params(model, args.run_dir + '/%s/' % task.name,
NameError: name 'MultiTaskTrainer' is not defined

I find that the 'MultiTaskTrainer' is not defined in the repository. I sincerely asking for the script for 'MultiTaskTrainer'. My great gratitude! @thxyutong @sleepinyourhat

@cyente
Copy link

cyente commented Dec 2, 2019

@Bogerchen hey bro, have you fix the problem?

@smolPixel
Copy link

Running into the same MultiTaskTrainer issue. Did someone find a fix? Also @sleepinyourhat concerning jiant I tried using it but found no options for running non-transformers architectures (I want to rerun the LSTM as described in the GloVe paper). Maybe I missed something? Would appreciate you pointing the right way to do it :)

@sleepinyourhat
Copy link
Contributor

The reference to jiant above was to v1.3: https://github.com/nyu-mll/jiant-v1-legacy

The new v2.0 is mostly a wrapper around Transformers, so it drops LSTM support. Start with v1.3.

@sleepinyourhat
Copy link
Contributor

You'll have a much easier time with jiant than with this repo, but if you need an exact reproduction for some reason, ping w4ngatang again.

@myzwisc
Copy link

myzwisc commented Oct 21, 2021

Can someone share the MultiTaskTrainer script? I really need this script to reproduce exactly the original GLUE benckmark. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants