Unable to obttain the results written in the paper #6

yutxie · 2018-12-20T08:23:43Z

I've tried on your relased code of baselines, but there were some differences between the results tested on the validation set and reported in your paper.

experiment, CoLA(mcc), SST-2, MRPC(acc/f1), QQP(acc/f1), STS-B(pear/spear), MNLI(m/mm), QNLI, RTE, WNLI
your paper, 240, 858, 719/821, 802/591, 688/670, 658/660, 711, 468, 637
my result, 125, 870, 740/829, 794/735, 726/726, 599/605, 584, 574, 141

Both employ the basic BiLSTM model, and follows the MTL setting
You can see there is a huge gap between the results of CoLA(24.0, 12.5) and WNLI(63.7, 14.1)

Here are my hyperparameter settings, could you please help me to check if I'm using the same as yours? This is based on run_staff.sh

GPUID=0
train_tasks='all' # 'all', 'none'
original_model_code=1 # 1 if use the original models.py
single_encoder=1 # never mind if original_model_code=1

SHOULD_TRAIN=1
SHOULD_TEST=0
LOAD_MODEL=0
LOAD_TASKS=1
LOAD_PREPROC=1
load_epoch=-1

SCRATCH_PREFIX='.'
EXP_NAME="preprocess"
RUN_NAME="results/original"
SEED=19
no_tqdm=0

eval_tasks='none'
CLASSIFIER=mlp
d_hid_cls=512
max_seq_len=40
VOCAB_SIZE=30000
WORD_EMBS_FILE="${SCRATCH_PREFIX}/embeddings/glove.840B.300d.txt"

d_word=300
d_hid=1500
glove=1
ELMO=0
deep_elmo=0
elmo_no_glove=0
COVE=0

PAIR_ENC="simple"
N_LAYERS_ENC=2
n_layers_highway=0

OPTIMIZER="adam"
LR=1e-3
min_lr=1e-5
dropout=.2
LR_DECAY=.2
patience=5
task_patience=0
train_words=0
WEIGHT_DECAY=0.0
SCHED_THRESH=0.0
BATCH_SIZE=128
BPP_METHOD="percent_tr"
BPP_BASE=10
VAL_INTERVAL=10000 # also the epoch_size, default=10
MAX_VALS=100
TASK_ORDERING="random"
weighting_method="uniform"
scaling_method='none'

while getopts 'ikmn:r:S:s:tvh:l:L:o:T:E:O:b:H:p:edcgP:qB:V:M:D:C:X:GI:N:y:K:W:' flag; do
    case "${flag}" in
        P) SCRATCH_PREFIX="${OPTARG}" ;;
        n) EXP_NAME="${OPTARG}" ;;
        r) RUN_NAME="${OPTARG}" ;;
        S) SEED="${OPTARG}" ;;
        q) no_tqdm=1 ;;
        t) SHOULD_TRAIN=0 ;;
        k) LOAD_TASKS=0 ;;
        m) LOAD_MODEL=1 ;;
        i) LOAD_PREPROC=0 ;;
        M) BPP_METHOD="${OPTARG}" ;; 
        B) BPP_BASE="${OPTARG}" ;;
        V) VAL_INTERVAL="${OPTARG}" ;;
        X) MAX_VALS="${OPTARG}" ;;
        T) train_tasks="${OPTARG}" ;;
        #E) eval_tasks="${OPTARG}" ;;
        O) TASK_ORDERING="${OPTARG}" ;;
        H) n_layers_highway="${OPTARG}" ;;
        l) LR="${OPTARG}" ;;
        #s) min_lr="${OPTARG}" ;;
        L) N_LAYERS_ENC="${OPTARG}" ;;
        o) OPTIMIZER="${OPTARG}" ;;
        h) d_hid="${OPTARG}" ;;
        b) BATCH_SIZE="${OPTARG}" ;;
        E) PAIR_ENC="${OPTARG}" ;;
        G) glove=0 ;;
        e) ELMO=1 ;;
        d) deep_elmo=1 ;;
        g) elmo_no_glove=1 ;;
        c) COVE=1 ;;
        D) dropout="${OPTARG}" ;;
        C) CLASSIFIER="${OPTARG}" ;;
        I) GPUID="${OPTARG}" ;;
        N) load_epoch="${OPTARG}" ;;
        y) LR_DECAY="${OPTARG}" ;;
        K) task_patience="${OPTARG}" ;;
        p) patience="${OPTARG}" ;;
        W) weighting_method="${OPTARG}" ;;
        s) scaling_method="${OPTARG}" ;;
    esac
done

LOG_PATH="${SCRATCH_PREFIX}/${RUN_NAME}/log.log"
EXP_DIR="${SCRATCH_PREFIX}/${EXP_NAME}/"
RUN_DIR="${SCRATCH_PREFIX}/${RUN_NAME}/"
mkdir -p ${EXP_DIR}
mkdir -p ${RUN_DIR}

ALLEN_CMD="python src/main.py --cuda ${GPUID} --random_seed ${SEED} --no_tqdm ${no_tqdm} --log_file ${LOG_PATH} --exp_dir ${EXP_DIR} --run_dir ${RUN_DIR} --train_tasks ${train_tasks} --eval_tasks ${eval_tasks} --classifier ${CLASSIFIER} --classifier_hid_dim ${d_hid_cls} --max_seq_len ${max_seq_len} --max_word_v_size ${VOCAB_SIZE} --word_embs_file ${WORD_EMBS_FILE} --train_words ${train_words} --glove ${glove} --elmo ${ELMO} --deep_elmo ${deep_elmo} --elmo_no_glove ${elmo_no_glove} --cove ${COVE} --d_word ${d_word} --d_hid ${d_hid} --n_layers_enc ${N_LAYERS_ENC} --pair_enc ${PAIR_ENC} --n_layers_highway ${n_layers_highway} --batch_size ${BATCH_SIZE} --bpp_method ${BPP_METHOD} --bpp_base ${BPP_BASE} --optimizer ${OPTIMIZER} --lr ${LR} --min_lr ${min_lr} --lr_decay_factor ${LR_DECAY} --task_patience ${task_patience} --patience ${patience} --weight_decay ${WEIGHT_DECAY} --dropout ${dropout} --val_interval ${VAL_INTERVAL} --max_vals ${MAX_VALS} --task_ordering ${TASK_ORDERING} --weighting_method ${weighting_method} --scaling_method ${scaling_method} --scheduler_threshold ${SCHED_THRESH} --load_model ${LOAD_MODEL} --load_tasks ${LOAD_TASKS} --load_preproc ${LOAD_PREPROC} --should_train ${SHOULD_TRAIN} --should_test ${SHOULD_TEST} --load_epoch ${load_epoch}"
eval ${ALLEN_CMD}

BTW, this is how I test on the validation set. The code is based on eval_test.py and main.py

import os
import json
import ipdb as pdb
import numpy as np

from sklearn.metrics import matthews_corrcoef, f1_score
from scipy.stats import pearsonr, spearmanr
from allennlp.data.dataset import Batch

def evaluate_val(tasks, val_preds):
    for eval_task, task_preds in val_preds.items(): # write predictions for each task
        #if 'mnli' not in eval_task:
        #    continue
        task = [task for task in tasks if task.name == eval_task][0]
        preds = task_preds[0]
        val_data = Batch(task.val_data).as_tensor_dict()
        golds = val_data['label']
        assert len(preds) == len(golds)
        if 'mnli' in eval_task:
            # matched
            evaluate('mnli-m', preds[:9815], golds[:9815])
            # mismatched
            evaluate('mnli-mm', preds[9815:9815+9832], golds[9815:9815+9832])
        else:
            metrics = ['acc']
            if 'cola' in eval_task:
                metrics = ['matthews']
            if 'mrpc' in eval_task or 'qqp' in eval_task:
                metrics = ['acc', 'f1']
            if 'sts' in eval_task:
                golds = golds * 5.
                metrics = ['corr']
            evaluate(eval_task, golds, preds, metrics)


def evaluate(task_name, golds, preds, metrics=['acc']):
    assert len(golds) == len(preds)
    print('***************************** %s:' % task_name)
    if 'acc' in metrics:
        acc = sum([1 for gold, pred in zip(golds, preds) if gold == pred]) / float(len(golds))
        print("acc: %.3f" % acc)
    if 'f1' in metrics:
        f1 = f1_score(golds, preds)
        print("f1: %.3f" % f1)
    if 'matthews' in metrics:
        mcc = matthews_corrcoef(golds, preds)
        print("mcc: %.3f" % mcc)
    if 'corr' in metrics:
        golds = np.asarray(golds).reshape(-1)
        preds = np.asarray(preds).reshape(-1)
        corr = pearsonr(golds, preds)[0]
        print("pearson r: %.3f" % corr)
        corr = spearmanr(golds, preds)[0]
        print("spearman r: %.3f" % corr)

Thanks!

The text was updated successfully, but these errors were encountered:

sleepinyourhat · 2018-12-20T23:39:58Z

Hi!

The results in the paper are test set results (as it says in the caption), and several datasets have non-trivial differences between the dev and test data, so it's possible that you've already reproduced our results exactly.

In any case, though, I'd urge you to use the newer jiant codebase. It's much better documented, and gets strictly better results than the baselines here. We don't have public dev set numbers from that codebase yet, but if you post an issue there, we should be able to assemble some.

https://github.com/jsalt18-sentence-repl/jiant

If you do need to use this codebase, reply here and @W4ngatang should be able to share the exact hyperparameters we used.

yutxie · 2018-12-21T04:38:40Z

Thanks for your enthusiastic reply!

I've submitted it to the GLUE platform, but there are still some gaps in CoLA, QNLI and WNLI.

| average | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI | QNLI | RTE | WNLI
BiLSTM baseline | 63.5 | 24 | 85.8 | 82.1/71.9 | 68.8/67.0 | 59.1/80.2 | 65.8/66 | 71.1 | 46.8 | 63.7
my results | 60.4 | 13.9 | 84.6 | 81.6/73.0 | 68.8/66.7 | 57.2/79.7 | 61.3/61.8 | 63 | 54.2 | 52.7

So it will be very nice of you to offer me the hyperparameters which produces the baselines on this codebase.

Besides, I'm willing to transfer to jiant, but I'm not sure whether I can produce the GLUE baselines with it. Can I obtain the results on the leaderboard using the final_glue_runs.sh script without modification?

Thanks again!

sleepinyourhat · 2018-12-21T16:19:11Z

@W4ngatang - Could you take this one?

If you need to exactly match our baselines, jiant won't do that. This paper publishes numbers from the final_glue_runs script, though: https://openreview.net/pdf?id=Bkl87h09FX

Sam

W4ngatang · 2019-02-04T20:27:28Z

Hey @xxxxxyt , I've added the exact scripts that I'm running here. Could you try running those?

Bogerchen · 2019-04-23T11:42:09Z

Hey, after fixing lots of issues, I tried running the code. However, I still get the following error:

Traceback (most recent call last):
File "src/main.py", line 280, in
sys.exit(main(sys.argv[1:]))
File "src/main.py", line 186, in main
trainer = MultiTaskTrainer.from_params(model, args.run_dir + '/%s/' % task.name,
NameError: name 'MultiTaskTrainer' is not defined

I find that the 'MultiTaskTrainer' is not defined in the repository. I sincerely asking for the script for 'MultiTaskTrainer'. My great gratitude! @thxyutong @sleepinyourhat

cyente · 2019-12-02T01:47:06Z

@Bogerchen hey bro, have you fix the problem?

smolPixel · 2020-11-23T21:02:31Z

Running into the same MultiTaskTrainer issue. Did someone find a fix? Also @sleepinyourhat concerning jiant I tried using it but found no options for running non-transformers architectures (I want to rerun the LSTM as described in the GloVe paper). Maybe I missed something? Would appreciate you pointing the right way to do it :)

sleepinyourhat · 2020-11-24T14:21:18Z

The reference to jiant above was to v1.3: https://github.com/nyu-mll/jiant-v1-legacy

The new v2.0 is mostly a wrapper around Transformers, so it drops LSTM support. Start with v1.3.

sleepinyourhat · 2020-11-24T14:22:14Z

You'll have a much easier time with jiant than with this repo, but if you need an exact reproduction for some reason, ping w4ngatang again.

myzwisc · 2021-10-21T20:40:05Z

Can someone share the MultiTaskTrainer script? I really need this script to reproduce exactly the original GLUE benckmark. Thanks.

sleepinyourhat added the wontfix label Dec 20, 2018

sleepinyourhat assigned W4ngatang Dec 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to obttain the results written in the paper #6

Unable to obttain the results written in the paper #6

yutxie commented Dec 20, 2018 •

edited

Loading

sleepinyourhat commented Dec 20, 2018

yutxie commented Dec 21, 2018

sleepinyourhat commented Dec 21, 2018

W4ngatang commented Feb 4, 2019

Bogerchen commented Apr 23, 2019

cyente commented Dec 2, 2019

smolPixel commented Nov 23, 2020

sleepinyourhat commented Nov 24, 2020

sleepinyourhat commented Nov 24, 2020

myzwisc commented Oct 21, 2021

Unable to obttain the results written in the paper #6

Unable to obttain the results written in the paper #6

Comments

yutxie commented Dec 20, 2018 • edited Loading

sleepinyourhat commented Dec 20, 2018

yutxie commented Dec 21, 2018

sleepinyourhat commented Dec 21, 2018

W4ngatang commented Feb 4, 2019

Bogerchen commented Apr 23, 2019

cyente commented Dec 2, 2019

smolPixel commented Nov 23, 2020

sleepinyourhat commented Nov 24, 2020

sleepinyourhat commented Nov 24, 2020

myzwisc commented Oct 21, 2021

yutxie commented Dec 20, 2018 •

edited

Loading