-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add malaya tasks #1
base: dev
Are you sure you want to change the base?
Conversation
I was not able to get the outputs between the original implementation and the eval-harness implementation to work. lm-evaluation-harness command:
Report output:
llm-benchmarks command
Reported result
The following changes were made to attempt to reproduce the results: Added to https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L96
to match the seeds used in https://github.com/aisingapore/lm-evaluation-harness/blob/malaya/lm_eval/evaluator.py#L77 But it appears that |
Commands and outputs
Output:
Output:
Differences:For some reason, the 3shot produces different values. Suspected due to floating point errors. 31.805157593123205 (llm-benchmarks) and 0.31805157593123207 (lm-eval). To fixto
After changing:
To replicateTo replicate the results:
llm-benchmarks changesThe last change is the fix for the potential
lm-eval changes
|
I managed to get a reproducible port of the Malaya task on this branch https://github.com/aisingapore/lm-evaluation-harness/tree/temp-malaya I used the following edits to control the random seeds: Change https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L10 to
Add the following kwargs to https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L112-L121
Modify the decode lines at https://github.com/aisingapore/llm-benchmarks/blob/dev/evaluate.py#L123-L124 to "fix" the index errors
Modify the list contents to force it to run the tasks separately.
On the
and avoid reordering the dataset by commenting out https://github.com/aisingapore/lm-evaluation-harness/blob/temp-malaya/lm_eval/utils.py#L837 There are some differences in the filtering, process_docs, and metrics calculation from the implementation in this PR. I didn't have to use the Regex hack to get the full output and I didn't have to use first N sampling. I was able to get the following outputs:
I notice the same 0.31805157593123207 vs 31.805157593123205 rounding error for tatabahasa 3 shot. |
No description provided.