You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A previous GH issue (here) mentions that a modified version of this script (here) was used to collect MMLU numbers. What about scripts for other benchmarks in the blogpost? As HuggingFace notes here, numbers can vary wildly if different evaluation codebases are used, would be useful to know if HELM vs Eleuther-AI's harness vs any internal benchmarking library was used for Hellaswag, Winogrande, etc. Same applies to the long sequence tasks (e.g. AMI, FD, SCROLLS) as well. Thanks!
The text was updated successfully, but these errors were encountered:
A previous GH issue (here) mentions that a modified version of this script (here) was used to collect MMLU numbers. What about scripts for other benchmarks in the blogpost? As HuggingFace notes here, numbers can vary wildly if different evaluation codebases are used, would be useful to know if HELM vs Eleuther-AI's harness vs any internal benchmarking library was used for Hellaswag, Winogrande, etc. Same applies to the long sequence tasks (e.g. AMI, FD, SCROLLS) as well. Thanks!
The text was updated successfully, but these errors were encountered: