add malaya tasks #1

wheynelau · 2024-06-14T05:29:53Z

No description provided.

liyier90 · 2024-06-19T05:43:10Z

I was not able to get the outputs between the original implementation and the eval-harness implementation to work.

lm-evaluation-harness command:

lm_eval \
    --model hf \
    --model_args "pretrained=google/gemma-2b,trust_remote_code=True" \
    --tasks malaya_bm_pt3_0shot,malaya_bm_pt3_1shot,malaya_bm_pt3_3shot,malaya_ttbhs_0shot,malaya_ttbhs_1shot,malaya_ttbhs_3shot \
    --device cuda:0 \
    --batch_size 1 \
    --output_path ... \
    --verbosity DEBUG \
    --log_samples

Report output:

hf (pretrained=google/gemma-2b,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|       Tasks       |Version|Filter|n-shot|Metric|Value |   |Stderr|
|-------------------|-------|------|-----:|------|-----:|---|-----:|
|malaya_bm_pt3_0shot|Yaml   |regex |     0|acc   |0.1481|±  |0.0488|
|malaya_bm_pt3_1shot|Yaml   |regex |     0|acc   |0.2963|±  |0.0627|
|malaya_bm_pt3_3shot|Yaml   |regex |     0|acc   |0.1296|±  |0.0461|
|malaya_ttbhs_0shot |Yaml   |regex |     0|acc   |0.0229|±  |0.0080|
|malaya_ttbhs_1shot |Yaml   |regex |     0|acc   |0.2464|±  |0.0231|
|malaya_ttbhs_3shot |Yaml   |regex |     0|acc   |0.2951|±  |0.0244|

{
  "malaya_bm_pt3_0shot": {
    "acc,regex": 0.14814814814814814,
    "acc_stderr,regex": 0.04879689797371375,
    "alias": "malaya_bm_pt3_0shot"
  },
  "malaya_bm_pt3_1shot": {
    "acc,regex": 0.2962962962962963,
    "acc_stderr,regex": 0.0627220284341492,
    "alias": "malaya_bm_pt3_1shot"
  },
  "malaya_bm_pt3_3shot": {
    "acc,regex": 0.12962962962962962,
    "acc_stderr,regex": 0.04613879568230498,
    "alias": "malaya_bm_pt3_3shot"
  },
  "malaya_ttbhs_0shot": {
    "acc,regex": 0.022922636103151862,
    "acc_stderr,regex": 0.008022452124849404,
    "alias": "malaya_ttbhs_0shot"
  },
  "malaya_ttbhs_1shot": {
    "acc,regex": 0.24641833810888253,
    "acc_stderr,regex": 0.023100003778706663,
    "alias": "malaya_ttbhs_1shot"
  },
  "malaya_ttbhs_3shot": {
    "acc,regex": 0.29512893982808025,
    "acc_stderr,regex": 0.02444956389052527,
    "alias": "malaya_ttbhs_3shot"
  }
}

llm-benchmarks command

python evaluate.py \
    --model_path google/gemma-2b \
    --name gemma-2b \
    --output_folder ...

Reported result

{'tatabahasa': {'n_shot=0': 1.4326647564469914, 'n_shot=1': 25.787965616045845, 'n_shot=3': 27.507163323782237}, 'bmpt3': {'n_shot=0': 18.51851851851852, 'n_shot=1': 25.925925925925924, 'n_shot=3': 24.074074074074073}}

The following changes were made to attempt to reproduce the results:

Added to https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L96

random.seed(0)
np.random.seed(1234)
torch.manual_seed(1234)

to match the seeds used in https://github.com/aisingapore/lm-evaluation-harness/blob/malaya/lm_eval/evaluator.py#L77

But it appears that simple_evaluate() is only called once in the entire flow. So the random number sequence may not match. https://github.com/aisingapore/lm-evaluation-harness/blob/malaya/lm_eval/__main__.py#L231

This reverts commit 99213b5.

wheynelau · 2024-06-20T07:44:34Z

Commands and outputs

llm-benchmarks command:

python evaluate.py \
--model_path "google/gemma-2b" \
--name "gemma-2b" \
--output_folder="results" \
--run_full true

Output:

{'tatabahasa': {'n_shot=0': 2.005730659025788, 'n_shot=1': 30.08595988538682, 'n_shot=3': 31.805157593123205}, 
'bmpt3': {'n_shot=0': 11.11111111111111, 'n_shot=1': 25.925925925925924, 'n_shot=3': 20.37037037037037}}

lm-eval command

lm_eval --model hf \
    --model_args pretrained=google/gemma-2b,trust_remote_code=true,dtype="float16" \
    --tasks malaya_bm_pt3_0shot,malaya_bm_pt3_1shot,malaya_bm_pt3_3shot,malaya_ttbhs_0shot,malaya_ttbhs_1shot,malaya_ttbhs_3shot \
    --device cuda:1 \
    --batch_size 1 \
    --log_samples \
    --output_path ./lm-eval-out

Output:

Tasks	Version	Filter	Metric	Value		Stderr
malaya_bm_pt3_0shot	Yaml	regex	acc	0.1111	±	0.0432
malaya_bm_pt3_1shot	Yaml	regex	acc	0.2593	±	0.0602
malaya_bm_pt3_3shot	Yaml	regex	acc	0.2037	±	0.0553
malaya_ttbhs_0shot	Yaml	regex	acc	0.0201	±	0.0075
malaya_ttbhs_1shot	Yaml	regex	acc	0.3009	±	0.0246
malaya_ttbhs_3shot	Yaml	regex	acc	0.3181	±	0.0250

"results": {
    "malaya_bm_pt3_0shot": {
      "acc,regex": 0.1111111111111111,
      "acc_stderr,regex": 0.04316826054921173,
      "alias": "malaya_bm_pt3_0shot"
    },
    "malaya_bm_pt3_1shot": {
      "acc,regex": 0.25925925925925924,
      "acc_stderr,regex": 0.06019526336088896,
      "alias": "malaya_bm_pt3_1shot"
    },
    "malaya_bm_pt3_3shot": {
      "acc,regex": 0.2037037037037037,
      "acc_stderr,regex": 0.055322127819126765,
      "alias": "malaya_bm_pt3_3shot"
    },
    "malaya_ttbhs_0shot": {
      "acc,regex": 0.02005730659025788,
      "acc_stderr,regex": 0.007515312155132812,
      "alias": "malaya_ttbhs_0shot"
    },
    "malaya_ttbhs_1shot": {
      "acc,regex": 0.3008595988538682,
      "acc_stderr,regex": 0.024585243484996137,
      "alias": "malaya_ttbhs_1shot"
    },
    "malaya_ttbhs_3shot": {
      "acc,regex": 0.31805157593123207,
      "acc_stderr,regex": 0.024965192491672093,
      "alias": "malaya_ttbhs_3shot"
    }
  },

Differences:

For some reason, the 3shot produces different values. Suspected due to floating point errors.

31.805157593123205 (llm-benchmarks) and 0.31805157593123207 (lm-eval).

To fix

Change https://github.com/aisingapore/llm-benchmarks/blob/3c4c4c38fe2bc9fc83ff54bb355889d11aba2bde/evaluate.py#L94

to

-    return (correct / len(filtered)) / 100
+    return (correct / len(filtered))

After changing:

{'tatabahasa': {'n_shot=0': 0.02005730659025788, 'n_shot=1': 0.3008595988538682, 'n_shot=3': 0.31805157593123207}, 
'bmpt3': {'n_shot=0': 0.1111111111111111, 'n_shot=1': 0.25925925925925924, 'n_shot=3': 0.2037037037037037}}

To replicate

To replicate the results:

Both do_sample must be set to false.
Use first N shot prompting for both

llm-benchmarks changes

The last change is the fix for the potential ValueError: max() iterable argument is empty

# llm-benchmarks/evaluate.py

@@ -99,7 +99,8 @@ def run_test(args, model, tokenizer, questions, n_shots):
         prompts = []
         if n_shots:
             arange = set(range(len(questions)))
-            shots = random.sample(arange - {i}, n_shots)
+            shots = sorted(arange-{i})[:n_shots]

@@ -115,7 +116,7 @@ def run_test(args, model, tokenizer, questions, n_shots):
                     top_p=0.95,
                     top_k=50,
                     temperature=0.5,
-                    do_sample=True,
+                    do_sample=False,

@@ -125,6 +126,8 @@ def run_test(args, model, tokenizer, questions, n_shots):
         
             except Exception as e:
                 print(e)
+                # necessary if all iterations fail, for most_common
+                repeat.append("")
                 pass

lm-eval changes

# lm-evaluation-harness/lm-eval/tasks/malaya/_default.yaml
@@ -5,13 +5,13 @@ doc_to_text: !function utils.doc_to_text
 doc_to_target: jawapan
 output_type: generate_until
 process_results: !function utils.process_results
repeats: 5
 generation_kwargs:
   max_new_tokens: 3
   top_p: 0.95
   top_k: 50
   temperature: 0.5
-  do_sample: true
+  do_sample: false
   num_beams: 1
   repetition_penalty: 1.05
   max_length: null
   
# lm-evaluation-harness/lm_eval/tasks/malaya/utils.py b/lm_eval/tasks/malaya/utils.py

@@ -23,9 +23,9 @@ def _process_wrapper(dataset: datasets.Dataset, num_fewshots: int):
         prompts = []
         # curr idx
         # sample N other indices
-        shots = random.sample(sorted(arange - {idx}), num_fewshots)
+        shots = sorted(arange - {idx})[:num_fewshots]
         for no, s in enumerate(shots, start=1):
             prompts.append(
                 f"Contoh soalan {no}\n{doc_to_text(dataset[s])} {doc_to_target(dataset[s])}"

liyier90 · 2024-07-01T07:25:02Z

I managed to get a reproducible port of the Malaya task on this branch https://github.com/aisingapore/lm-evaluation-harness/tree/temp-malaya

I used the following edits to control the random seeds:

Change https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L10 to

random.seed(0)
np.random.seed(1234)
torch.manual_seed(1234)

Add the following kwargs to https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L112-L121

# temperature=0.5,
# do_sample=True,
do_sample=False,
...
max_length=None,

Modify the decode lines at https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L123-L124 to "fix" the index errors

r = tokenizer.decode(r[0]).split('jawapan:')[-1].strip()
if r:
    r = r.split()[0]
    r = r.replace('.', '').replace('</s>', '')
    r = r.split('\\')[0].split('/')[0]
repeat.append(r)

Modify the list contents to force it to run the tasks separately.
https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L174-L193
For example

...
for i in [0]:
...

On the lm-evaluation-harness side, set the following task configs https://github.com/aisingapore/lm-evaluation-harness/blob/temp-malaya/lm_eval/tasks/malaya/_default_yaml#L13-L15

  # temperature: 0.5
  # do_sample: true
  do_sample: false

and avoid reordering the dataset by commenting out https://github.com/aisingapore/lm-evaluation-harness/blob/temp-malaya/lm_eval/utils.py#L837

There are some differences in the filtering, process_docs, and metrics calculation from the implementation in this PR. I didn't have to use the Regex hack to get the full output and I didn't have to use first N sampling.

I was able to get the following outputs:

lm-evaluation-harness

  "malaya_tatabahasa_0shot": {
    "acc,multi-choice-extract": 0.02005730659025788,
    "acc_stderr,multi-choice-extract": 0.007515312155132861,
    "alias": "malaya_tatabahasa_0shot"
  }
}
{
  "malaya_tatabahasa_1shot": {
    "acc,multi-choice-extract": 0.2808022922636103,
    "acc_stderr,multi-choice-extract": 0.024089891816072878,
    "alias": "malaya_tatabahasa_1shot"
  }
}
{
  "malaya_tatabahasa_3shot": {
    "acc,multi-choice-extract": 0.31805157593123207,
    "acc_stderr,multi-choice-extract": 0.02496519249167206,
    "alias": "malaya_tatabahasa_3shot"
  }
}
{
  "malaya_bm_pt3_0shot": {
    "acc,multi-choice-extract": 0.12962962962962962,
    "acc_stderr,multi-choice-extract": 0.04613879568230499,
    "alias": "malaya_bm_pt3_0shot"
  }
}
{
  "malaya_bm_pt3_1shot": {
    "acc,multi-choice-extract": 0.25925925925925924,
    "acc_stderr,multi-choice-extract": 0.06019526336088894,
    "alias": "malaya_bm_pt3_1shot"
  }
}
{
  "malaya_bm_pt3_3shot": {
    "acc,multi-choice-extract": 0.2777777777777778,
    "acc_stderr,multi-choice-extract": 0.06152423727810978,
    "alias": "malaya_bm_pt3_3shot"
  }
}

llm-benchmarks

{
  "tatabahasa": {
    "n_shot=0": 2.005730659025788
  },
  "bmpt3": {}
}
{
  "tatabahasa": {
    "n_shot=1": 28.08022922636103
  },
  "bmpt3": {}
}
{
  "tatabahasa": {
    "n_shot=3": 31.805157593123205
  },
  "bmpt3": {}
}
{
  "tatabahasa": {},
  "bmpt3": {
    "n_shot=0": 12.962962962962962
  }
}
{
  "tatabahasa": {},
  "bmpt3": {
    "n_shot=1": 25.925925925925924
  }
}
{
  "tatabahasa": {},
  "bmpt3": {
    "n_shot=3": 27.77777777777778
  }
}

I notice the same 0.31805157593123207 vs 31.805157593123205 rounding error for tatabahasa 3 shot.

wheynelau added 12 commits June 10, 2024 17:15

wip malaya

224fa71

temp push

bb53e49

update

ced9668

working 0 shot

24bd302

change sample to true

7add422

fix utils.py whitespace

216be2a

chore: lint

92b49ad

working n shot prompting

c684498

docs: remove mmlu

73747b7

fix: revert source code and docs for yaml

dfaa872

docs: add readme

08ce500

fix: remove translated mmlu and add doc

a172b9d

wheynelau added 3 commits June 19, 2024 11:36

fix: wrong indices for prompts

8072580

fix: wrong indices for n shot prompting

99213b5

Revert "fix: wrong indices for n shot prompting"

940ec93

This reverts commit 99213b5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add malaya tasks #1

add malaya tasks #1

wheynelau commented Jun 14, 2024

liyier90 commented Jun 19, 2024

wheynelau commented Jun 20, 2024

liyier90 commented Jul 1, 2024

add malaya tasks #1

Are you sure you want to change the base?

add malaya tasks #1

Conversation

wheynelau commented Jun 14, 2024

liyier90 commented Jun 19, 2024

wheynelau commented Jun 20, 2024

Commands and outputs

Differences:

To fix

To replicate

llm-benchmarks changes

lm-eval changes

liyier90 commented Jul 1, 2024