Skip to content

Latest commit

 

History

History
107 lines (71 loc) · 4.34 KB

eval_humaneval.md

File metadata and controls

107 lines (71 loc) · 4.34 KB

eval_humaneval

Dataset

datasets/HUMANEVAL/HumanEval.jsonl.gz The original English version of the HumanEval dataset containing 164 questions.

datasets/HUMANEVAL/HumanEval-textprompts.jsonl The Chinese version of the HumanEval dataset obtained by translation with the aid of gpt-4 model.

Evaluation

Introduction

examples/eval_humaneval_2B.sh. The evaluation results for Chinese HumanEval on 2B model could be obtained by running this program.

examples/eval_humaneval_51B.sh. The evaluation results for Chinese HumanEval on 51B model could be obtained by running this program.

examples/eval_humaneval_102B.sh. The evaluation results for Chinese HumanEval on 102B model could be obtained by running this program.

Before running the evaluation program, you only have to specify the following checkpoint_path in bash script by yourself, the other necessary paths are already set up. If you are evaluating the checkpoint merged by yourself, make sure to change the following --tensor-model-parallel-size parameter in the bash script to the number of tensor model parallel size on your newly merged checkpoint.:

Variable name Description
CHECKPOINT_PATH the path that saves the checkpoint to be evaluated.
tensor-model-parallel-size the number of tensor model parallel size of the evaluating checkpoint.

Requirement

Make sure HumanEval program is installed befere running the HumanEval evaluation on Yuan2.0 checkpoint.

$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval

After HumanEval program is installed, we shall go to this script,

/usr/local/lib/python3.10/dist-packages/human_eval-1.0-py3.10.egg/human_eval/execution.py

and make the following change on "check_program" variable in "check_correctness" function, to ensure there is no duplicate function signature in generated codes.

check_program = (
    #problem["prompt"] +
    completion + "\n" +
    problem["test"] + "\n" +
    f"check({problem['entry_point']})"
)

Also, if you are new to HumanEval, you have to delete the extra "#" in "check_program", right before the line "exec(check_program, exec_globals)".

# WARNING
# This program exists to execute untrusted model-generated code. Although
# it is highly unlikely that model-generated code will do something overtly
# malicious in response to this test suite, model-generated code may act
# destructively due to a lack of model capability or alignment.
# Users are strongly encouraged to sandbox this evaluation suite so that it
# does not perform destructive actions on their host or network. For more
# information on how OpenAI sandboxes its code, see the accompanying paper.
# Once you have read this disclaimer and taken appropriate precautions,
# uncomment the following line and proceed at your own risk:
                         exec(check_program, exec_globals)
                result.append("passed")
            except TimeoutException:
                result.append("timed out")
            except BaseException as e:
                result.append(f"failed: {e}")

Usage

Run the following commands to evaluate the 2B, 51B, and 102B model's performance on HumanEval dataset, respectively. Before running the bash script, you shall change directory to main directory of 'Yuan-2.0', and you only have to specify the the checkpoit path where you store the 102B checkpoint in the bash script, the other paths, like tokenizer and HumanEval dataset are already set up in the bash scripts.

Evaluate 102B model on HumanEval dataset.

cd <Specify Path>/Yuan-2.0/
bash examples/eval_humaneval_102B.sh

Evaluate 51B model on HumanEval dataset.

cd <Specify Path>/Yuan-2.0/
bash examples/eval_humaneval_51B.sh

Evaluate 2B model on HumanEval dataset.

cd <Specify Path>/Yuan-2.0/
bash examples/eval_humaneval_2B.sh

Results

The evaluation results will be gathered in samples.jsonl in $OUTPUT_PATH. After the generation of all the tasks done, the "evaluate_functional_correctness" function of HumanEval would automatically evaluate the results and return the accuracy.