datasets/HUMANEVAL/HumanEval.jsonl.gz The original English version of the HumanEval dataset containing 164 questions.
datasets/HUMANEVAL/HumanEval-textprompts.jsonl The Chinese version of the HumanEval dataset obtained by translation with the aid of gpt-4 model.
examples/eval_humaneval_2B.sh. The evaluation results for Chinese HumanEval on 2B model could be obtained by running this program.
examples/eval_humaneval_51B.sh. The evaluation results for Chinese HumanEval on 51B model could be obtained by running this program.
examples/eval_humaneval_102B.sh. The evaluation results for Chinese HumanEval on 102B model could be obtained by running this program.
Before running the evaluation program, you only have to specify the following checkpoint_path in bash script by yourself, the other necessary paths are already set up. If you are evaluating the checkpoint merged by yourself, make sure to change the following --tensor-model-parallel-size
parameter in the bash script to the number of tensor model parallel size on your newly merged checkpoint.:
Variable name | Description |
---|---|
CHECKPOINT_PATH |
the path that saves the checkpoint to be evaluated. |
tensor-model-parallel-size |
the number of tensor model parallel size of the evaluating checkpoint. |
Make sure HumanEval program is installed befere running the HumanEval evaluation on Yuan2.0 checkpoint.
$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval
After HumanEval program is installed, we shall go to this script,
/usr/local/lib/python3.10/dist-packages/human_eval-1.0-py3.10.egg/human_eval/execution.py
and make the following change on "check_program" variable in "check_correctness" function, to ensure there is no duplicate function signature in generated codes.
check_program = (
#problem["prompt"] +
completion + "\n" +
problem["test"] + "\n" +
f"check({problem['entry_point']})"
)
Also, if you are new to HumanEval, you have to delete the extra "#" in "check_program", right before the line "exec(check_program, exec_globals)".
# WARNING
# This program exists to execute untrusted model-generated code. Although
# it is highly unlikely that model-generated code will do something overtly
# malicious in response to this test suite, model-generated code may act
# destructively due to a lack of model capability or alignment.
# Users are strongly encouraged to sandbox this evaluation suite so that it
# does not perform destructive actions on their host or network. For more
# information on how OpenAI sandboxes its code, see the accompanying paper.
# Once you have read this disclaimer and taken appropriate precautions,
# uncomment the following line and proceed at your own risk:
exec(check_program, exec_globals)
result.append("passed")
except TimeoutException:
result.append("timed out")
except BaseException as e:
result.append(f"failed: {e}")
Run the following commands to evaluate the 2B, 51B, and 102B model's performance on HumanEval dataset, respectively. Before running the bash script, you shall change directory to main directory of 'Yuan-2.0', and you only have to specify the the checkpoit path where you store the 102B checkpoint in the bash script, the other paths, like tokenizer and HumanEval dataset are already set up in the bash scripts.
Evaluate 102B model on HumanEval dataset.
cd <Specify Path>/Yuan-2.0/
bash examples/eval_humaneval_102B.sh
Evaluate 51B model on HumanEval dataset.
cd <Specify Path>/Yuan-2.0/
bash examples/eval_humaneval_51B.sh
Evaluate 2B model on HumanEval dataset.
cd <Specify Path>/Yuan-2.0/
bash examples/eval_humaneval_2B.sh
The evaluation results will be gathered in samples.jsonl in $OUTPUT_PATH. After the generation of all the tasks done, the "evaluate_functional_correctness" function of HumanEval would automatically evaluate the results and return the accuracy.