Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add malaya tasks #1

Open
wants to merge 15 commits into
base: dev
Choose a base branch
from
Open

add malaya tasks #1

wants to merge 15 commits into from

Conversation

wheynelau
Copy link

No description provided.

@liyier90
Copy link
Collaborator

I was not able to get the outputs between the original implementation and the eval-harness implementation to work.

lm-evaluation-harness command:

lm_eval \
    --model hf \
    --model_args "pretrained=google/gemma-2b,trust_remote_code=True" \
    --tasks malaya_bm_pt3_0shot,malaya_bm_pt3_1shot,malaya_bm_pt3_3shot,malaya_ttbhs_0shot,malaya_ttbhs_1shot,malaya_ttbhs_3shot \
    --device cuda:0 \
    --batch_size 1 \
    --output_path ... \
    --verbosity DEBUG \
    --log_samples

Report output:

hf (pretrained=google/gemma-2b,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|       Tasks       |Version|Filter|n-shot|Metric|Value |   |Stderr|
|-------------------|-------|------|-----:|------|-----:|---|-----:|
|malaya_bm_pt3_0shot|Yaml   |regex |     0|acc   |0.1481|±  |0.0488|
|malaya_bm_pt3_1shot|Yaml   |regex |     0|acc   |0.2963|±  |0.0627|
|malaya_bm_pt3_3shot|Yaml   |regex |     0|acc   |0.1296|±  |0.0461|
|malaya_ttbhs_0shot |Yaml   |regex |     0|acc   |0.0229|±  |0.0080|
|malaya_ttbhs_1shot |Yaml   |regex |     0|acc   |0.2464|±  |0.0231|
|malaya_ttbhs_3shot |Yaml   |regex |     0|acc   |0.2951|±  |0.0244|

{
  "malaya_bm_pt3_0shot": {
    "acc,regex": 0.14814814814814814,
    "acc_stderr,regex": 0.04879689797371375,
    "alias": "malaya_bm_pt3_0shot"
  },
  "malaya_bm_pt3_1shot": {
    "acc,regex": 0.2962962962962963,
    "acc_stderr,regex": 0.0627220284341492,
    "alias": "malaya_bm_pt3_1shot"
  },
  "malaya_bm_pt3_3shot": {
    "acc,regex": 0.12962962962962962,
    "acc_stderr,regex": 0.04613879568230498,
    "alias": "malaya_bm_pt3_3shot"
  },
  "malaya_ttbhs_0shot": {
    "acc,regex": 0.022922636103151862,
    "acc_stderr,regex": 0.008022452124849404,
    "alias": "malaya_ttbhs_0shot"
  },
  "malaya_ttbhs_1shot": {
    "acc,regex": 0.24641833810888253,
    "acc_stderr,regex": 0.023100003778706663,
    "alias": "malaya_ttbhs_1shot"
  },
  "malaya_ttbhs_3shot": {
    "acc,regex": 0.29512893982808025,
    "acc_stderr,regex": 0.02444956389052527,
    "alias": "malaya_ttbhs_3shot"
  }
}

llm-benchmarks command

python evaluate.py \
    --model_path google/gemma-2b \
    --name gemma-2b \
    --output_folder ...

Reported result

{'tatabahasa': {'n_shot=0': 1.4326647564469914, 'n_shot=1': 25.787965616045845, 'n_shot=3': 27.507163323782237}, 'bmpt3': {'n_shot=0': 18.51851851851852, 'n_shot=1': 25.925925925925924, 'n_shot=3': 24.074074074074073}}

The following changes were made to attempt to reproduce the results:

Added to https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L96

random.seed(0)
np.random.seed(1234)
torch.manual_seed(1234)

to match the seeds used in https://github.com/aisingapore/lm-evaluation-harness/blob/malaya/lm_eval/evaluator.py#L77

But it appears that simple_evaluate() is only called once in the entire flow. So the random number sequence may not match. https://github.com/aisingapore/lm-evaluation-harness/blob/malaya/lm_eval/__main__.py#L231

@wheynelau
Copy link
Author

Commands and outputs

llm-benchmarks command:

python evaluate.py \
--model_path "google/gemma-2b" \
--name "gemma-2b" \
--output_folder="results" \
--run_full true

Output:

{'tatabahasa': {'n_shot=0': 2.005730659025788, 'n_shot=1': 30.08595988538682, 'n_shot=3': 31.805157593123205}, 
'bmpt3': {'n_shot=0': 11.11111111111111, 'n_shot=1': 25.925925925925924, 'n_shot=3': 20.37037037037037}}

lm-eval command

lm_eval --model hf \
    --model_args pretrained=google/gemma-2b,trust_remote_code=true,dtype="float16" \
    --tasks malaya_bm_pt3_0shot,malaya_bm_pt3_1shot,malaya_bm_pt3_3shot,malaya_ttbhs_0shot,malaya_ttbhs_1shot,malaya_ttbhs_3shot \
    --device cuda:1 \
    --batch_size 1 \
    --log_samples \
    --output_path ./lm-eval-out

Output:

Tasks Version Filter n-shot Metric Value Stderr
malaya_bm_pt3_0shot Yaml regex 0 acc 0.1111 ± 0.0432
malaya_bm_pt3_1shot Yaml regex 0 acc 0.2593 ± 0.0602
malaya_bm_pt3_3shot Yaml regex 0 acc 0.2037 ± 0.0553
malaya_ttbhs_0shot Yaml regex 0 acc 0.0201 ± 0.0075
malaya_ttbhs_1shot Yaml regex 0 acc 0.3009 ± 0.0246
malaya_ttbhs_3shot Yaml regex 0 acc 0.3181 ± 0.0250
"results": {
    "malaya_bm_pt3_0shot": {
      "acc,regex": 0.1111111111111111,
      "acc_stderr,regex": 0.04316826054921173,
      "alias": "malaya_bm_pt3_0shot"
    },
    "malaya_bm_pt3_1shot": {
      "acc,regex": 0.25925925925925924,
      "acc_stderr,regex": 0.06019526336088896,
      "alias": "malaya_bm_pt3_1shot"
    },
    "malaya_bm_pt3_3shot": {
      "acc,regex": 0.2037037037037037,
      "acc_stderr,regex": 0.055322127819126765,
      "alias": "malaya_bm_pt3_3shot"
    },
    "malaya_ttbhs_0shot": {
      "acc,regex": 0.02005730659025788,
      "acc_stderr,regex": 0.007515312155132812,
      "alias": "malaya_ttbhs_0shot"
    },
    "malaya_ttbhs_1shot": {
      "acc,regex": 0.3008595988538682,
      "acc_stderr,regex": 0.024585243484996137,
      "alias": "malaya_ttbhs_1shot"
    },
    "malaya_ttbhs_3shot": {
      "acc,regex": 0.31805157593123207,
      "acc_stderr,regex": 0.024965192491672093,
      "alias": "malaya_ttbhs_3shot"
    }
  },
  

Differences:

For some reason, the 3shot produces different values. Suspected due to floating point errors.

31.805157593123205 (llm-benchmarks) and 0.31805157593123207 (lm-eval).

To fix

Change https://github.com/aisingapore/llm-benchmarks/blob/3c4c4c38fe2bc9fc83ff54bb355889d11aba2bde/evaluate.py#L94

to

-    return (correct / len(filtered)) / 100
+    return (correct / len(filtered))

After changing:

{'tatabahasa': {'n_shot=0': 0.02005730659025788, 'n_shot=1': 0.3008595988538682, 'n_shot=3': 0.31805157593123207}, 
'bmpt3': {'n_shot=0': 0.1111111111111111, 'n_shot=1': 0.25925925925925924, 'n_shot=3': 0.2037037037037037}}

To replicate

To replicate the results:

  1. Both do_sample must be set to false.
  2. Use first N shot prompting for both

llm-benchmarks changes

The last change is the fix for the potential ValueError: max() iterable argument is empty

# llm-benchmarks/evaluate.py

@@ -99,7 +99,8 @@ def run_test(args, model, tokenizer, questions, n_shots):
         prompts = []
         if n_shots:
             arange = set(range(len(questions)))
-            shots = random.sample(arange - {i}, n_shots)
+            shots = sorted(arange-{i})[:n_shots]

@@ -115,7 +116,7 @@ def run_test(args, model, tokenizer, questions, n_shots):
                     top_p=0.95,
                     top_k=50,
                     temperature=0.5,
-                    do_sample=True,
+                    do_sample=False,

@@ -125,6 +126,8 @@ def run_test(args, model, tokenizer, questions, n_shots):
         
             except Exception as e:
                 print(e)
+                # necessary if all iterations fail, for most_common
+                repeat.append("")
                 pass

lm-eval changes

# lm-evaluation-harness/lm-eval/tasks/malaya/_default.yaml
@@ -5,13 +5,13 @@ doc_to_text: !function utils.doc_to_text
 doc_to_target: jawapan
 output_type: generate_until
 process_results: !function utils.process_results
repeats: 5
 generation_kwargs:
   max_new_tokens: 3
   top_p: 0.95
   top_k: 50
   temperature: 0.5
-  do_sample: true
+  do_sample: false
   num_beams: 1
   repetition_penalty: 1.05
   max_length: null
   
# lm-evaluation-harness/lm_eval/tasks/malaya/utils.py b/lm_eval/tasks/malaya/utils.py

@@ -23,9 +23,9 @@ def _process_wrapper(dataset: datasets.Dataset, num_fewshots: int):
         prompts = []
         # curr idx
         # sample N other indices
-        shots = random.sample(sorted(arange - {idx}), num_fewshots)
+        shots = sorted(arange - {idx})[:num_fewshots]
         for no, s in enumerate(shots, start=1):
             prompts.append(
                 f"Contoh soalan {no}\n{doc_to_text(dataset[s])} {doc_to_target(dataset[s])}"

@liyier90
Copy link
Collaborator

liyier90 commented Jul 1, 2024

I managed to get a reproducible port of the Malaya task on this branch https://github.com/aisingapore/lm-evaluation-harness/tree/temp-malaya

I used the following edits to control the random seeds:

Change https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L10 to

random.seed(0)
np.random.seed(1234)
torch.manual_seed(1234)

Add the following kwargs to https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L112-L121

# temperature=0.5,
# do_sample=True,
do_sample=False,
...
max_length=None,

Modify the decode lines at https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L123-L124 to "fix" the index errors

r = tokenizer.decode(r[0]).split('jawapan:')[-1].strip()
if r:
    r = r.split()[0]
    r = r.replace('.', '').replace('</s>', '')
    r = r.split('\\')[0].split('/')[0]
repeat.append(r)

Modify the list contents to force it to run the tasks separately.
https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L174-L193
For example

...
for i in [0]:
...

On the lm-evaluation-harness side, set the following task configs https://github.com/aisingapore/lm-evaluation-harness/blob/temp-malaya/lm_eval/tasks/malaya/_default_yaml#L13-L15

  # temperature: 0.5
  # do_sample: true
  do_sample: false

and avoid reordering the dataset by commenting out https://github.com/aisingapore/lm-evaluation-harness/blob/temp-malaya/lm_eval/utils.py#L837

There are some differences in the filtering, process_docs, and metrics calculation from the implementation in this PR. I didn't have to use the Regex hack to get the full output and I didn't have to use first N sampling.

I was able to get the following outputs:

lm-evaluation-harness

  "malaya_tatabahasa_0shot": {
    "acc,multi-choice-extract": 0.02005730659025788,
    "acc_stderr,multi-choice-extract": 0.007515312155132861,
    "alias": "malaya_tatabahasa_0shot"
  }
}
{
  "malaya_tatabahasa_1shot": {
    "acc,multi-choice-extract": 0.2808022922636103,
    "acc_stderr,multi-choice-extract": 0.024089891816072878,
    "alias": "malaya_tatabahasa_1shot"
  }
}
{
  "malaya_tatabahasa_3shot": {
    "acc,multi-choice-extract": 0.31805157593123207,
    "acc_stderr,multi-choice-extract": 0.02496519249167206,
    "alias": "malaya_tatabahasa_3shot"
  }
}
{
  "malaya_bm_pt3_0shot": {
    "acc,multi-choice-extract": 0.12962962962962962,
    "acc_stderr,multi-choice-extract": 0.04613879568230499,
    "alias": "malaya_bm_pt3_0shot"
  }
}
{
  "malaya_bm_pt3_1shot": {
    "acc,multi-choice-extract": 0.25925925925925924,
    "acc_stderr,multi-choice-extract": 0.06019526336088894,
    "alias": "malaya_bm_pt3_1shot"
  }
}
{
  "malaya_bm_pt3_3shot": {
    "acc,multi-choice-extract": 0.2777777777777778,
    "acc_stderr,multi-choice-extract": 0.06152423727810978,
    "alias": "malaya_bm_pt3_3shot"
  }
}

llm-benchmarks

{
  "tatabahasa": {
    "n_shot=0": 2.005730659025788
  },
  "bmpt3": {}
}
{
  "tatabahasa": {
    "n_shot=1": 28.08022922636103
  },
  "bmpt3": {}
}
{
  "tatabahasa": {
    "n_shot=3": 31.805157593123205
  },
  "bmpt3": {}
}
{
  "tatabahasa": {},
  "bmpt3": {
    "n_shot=0": 12.962962962962962
  }
}
{
  "tatabahasa": {},
  "bmpt3": {
    "n_shot=1": 25.925925925925924
  }
}
{
  "tatabahasa": {},
  "bmpt3": {
    "n_shot=3": 27.77777777777778
  }
}

I notice the same 0.31805157593123207 vs 31.805157593123205 rounding error for tatabahasa 3 shot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants