This is an evaluation set for the problem of directed/targeted test input generation, especially targeting Large Language Models (LLMs). The goal of directed test input (a.k.a. targeted test input) generation is to automatically generate test inputs to reach a certain code location or produce a particular result.
For example:
// EXAMPLE 1:
// generate an input of `s` that can reach the target line:
int arr(char *s){
int symvar = s[0] - 48;
int ary[] ={1,2,3,4,5};
if (ary[symvar%5] == 5){
// target line
}
}
// EXAMPLE 2:
// or produce the expected output
int execute(char* s) {
int ret = system(s);
if(ret == 0){
return OK; // expected
}
return ERR;
}
Targeted input generation plays a crucial role in various software engineering and security tasks, including fuzzing, bug reproduction (where the target is the bug location), and test suite augmentation (where the target is the specific code to be covered). It has also been extensively integrated with conventional testing tools to enhance coverage and overall performance.
Constraint-based techniques, such as symbolic execution and concolic testing, have been well-explored in this problem while Large Language Models (LLMs) have demonstrated exceptionally good performance in code understanding and reasoning. We use PathEval to benchmark and evaluate the ability of LLMs to solve the problem of directed test input generation.
LLMs and constraint-based tools have distinct advantages and disadvantages towards this problem. For instance, it is easy for constrint-based tools to generate the target input for the above EXAMPLE 1 while being challenging for LLMs. Interestingly, it is the opposite for EXAMPLE 2.
Install gcc, g++, python3 and openjdk by the following commands.
apt update
apt install -y -q build-essential gcc g++ python3 python3-pip openjdk-11-jdk-headless libssl-dev
python3 -m pip install tqdm
git clone https://github.com/CGCL-codes/PathEval.git
cd PathEval
git clone https://github.com/CGCL-codes/PathEval.git
cd PathEval
docker build -t patheval .
docker run -it --rm patheval /bin/bash
# cd /work in container
The evaluation will compile and execute untrusted model-generated data and code (see scripts/*_check.py). It is strongly encouraged to run this project in a sandbox (e.g., docker container).
Users can simply use this dataset through a couple of APIs.
Example
from patheval import set_dataset, read_problems, evaluate_one
# select the dataset that corresponds to the language
set_dataset("patheval_cpp") # or "patheval_java" or "patheval_py" or "logic_bombs_c"
# load the problems
problems = read_problems()
# give the completion from your LLMs and the evaluate result will return.
# completion = query(...)
evaluate_one(problems[0], completion)['pass'] # True / False
Or you can validate offline with the script patheval.py
:
python3 patheval.py --dataset patheval_java --input sample.jsonl
For the optional dataset see the documentation for set_dataset
below.
The format of sample.jsonl
is as follows, the order doesn't matter.
{"index": 486, "completion": "\"Mary had a little lamb\", 4"}
{"index": 532, "completion": "\"Hello,Hello,world !\" "}
{"index": 572, "completion": "Arrays.asList(1.0, 2., 3.)"}
...
We expect that each question can have one or more rows of answers (multiple rows of data with the same index). If a question does not have a corresponding answer, then it fails by default.
APIs
set_dataset(dataset_name)
- Purpose: Selects the dataset for evaluation.
- Parameters:
dataset_name
(str): The name of the dataset to use. Valid values are:"patheval_cpp"
: C++ dataset"patheval_java"
: Java dataset"patheval_py"
: Python dataset"logic_bombs_c"
: logic bombs dataset (in C language)
- Returns: None
read_problems()
- Purpose: Loads the problems from the selected dataset.
- Parameters: None
- Returns:
problems
(list): A list of problems.
evaluate_one(problem, completion)
- Purpose: Evaluate the given completion for a single problem.
- Parameters:
problem
(object): One problem from theread_problems()
returned list.completion
(str): The completion generated by LLMs.
- Returns: the input
problem
dictionary, with an additional key:"pass"
(bool): Indicates whether the completion passes the problem
For each sample, the following information is provided:
Key | Description |
---|---|
index | index in this dataset |
humaneval_task_id | task id in HumanEval |
focal_method_name | function name of focal method |
focal_method_para | parameters of focal method |
focal_method_return_type | return value type of focal method |
focal_method | code of focal method |
target | code of target |
In logic_bombs samples, humaneval_task_id is replaced by logic_bombs_task_id.
The samples are placed under the data
folder in .jsonl
files, the samples in the three different files are semantically equivalent, but are implemented in different programming languages.
An example C++ sample is shown as follow.
{
"index": 180,
"humaneval_task_id": "CPP/53",
"focal_method_name": "add",
"focal_method_para": "(int x,int y)",
"focal_method_return_type": "int",
"focal_method": "#include<stdio.h>\n#include<stdlib.h>\nusing namespace std;\n#include<algorithm>\n#include<math.h>\nint add(int x,int y){\n return x+y;\n}",
"target": "#undef NDEBUG\n#include<assert.h>\n\nint main(){\n\tauto result = add(<FILL_ME>);\n\tassert(result==5);\n}"
}
We use <FILL_ME>
to mark the position for LLMs to fill in.
We are extending the dataset from real-world open-source projects, a sample is shown under data/rw
. These samples will be added to this repo soon.
There exists subtle difference in the number of samples in the dataset for the three programming languages due to the fact that this dataset was converted from HumanEval-X in a automated manner, where a very small number of samples failed in the automated process and were thus discarded.
This is originally created for our paper "Towards Understanding the Effectiveness of Large Language Models on Directed Test Input Generation" (ASE 2024). The preview version is here.
@inproceedings{jiang2024towards,
title={Towards Understanding the Effectiveness of Large Language Models on Directed Test Input Generation},
author={Zongze, Jiang and Ming, Wen and Jialun, Cao and Xuanhua, Shi and Hai, Jin },
booktitle={39th {IEEE/ACM} International Conference on Automated Software Engineering,
{ASE} 2024, California, United States, October 27 - November 1, 2024},
year={2024}
}