Here we provide a step by step guide for adding a new task to the bigcode-evaluation-harness
to evaluate code generation language models. The process is similar to adding tasks in lm_evaluation-harness, from which this repository is inspired, so this document is based on their task_guide. The Task
class is the backbone of all tasks in this framewok.
If you haven't already, fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
# After forking...
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
git checkout -b <task-name>
pip install -r requirements.txt
From the bigcode-evaluation-harness
project root, copy over the new_task.py
template to bigcode_eval/tasks
.
cp template/new_task.py bigcode_eval/tasks/<task-name>.py
Open the file you've just created and add a multiline docstring on the first line with the following contents:
"""
<Paper title>
<Paper PDF URL>
<Short description of task>
Homepage: <URL to task's homepage>
"""
All data downloading and management is handled through the HuggingFace (HF) datasets
API. So, if your dataset isn't already on the hub (see catalog), please consider adding it to make it accessible to a wider user base by following this new dataset guide.
Now, that you have your HF dataset, you need to assign its path and name to your Task
in the following fields:
class TaskName(...):
DATASET_PATH = "..."
DATASET_NAME = "..."
where DATASET_PATH
is the name of the dataset as listed by HF in the datasets
Hub and DATASET_NAME
is the name of sub-task of the benchmark. If your task does not contain any data instances/subsets, just set DATASET_NAME = None
.
Next you need to load the evaluation split of the dataset in get_dataset
function. For example
def get_dataset(self):
return self.dataset["test"]
You might need to redefine some arguments of the class, like stop_words
which defines the stop words for stopping criteria during the code generation, and requires_execution
which defines whether the task requires code execution or not.
def __init__(self):
super().__init__(
stop_words=["\n"],
requires_execution=True,
)
Then you need to format your document into a single query prompt without the answer to be sent to the Language Model in get_prompt
method.
It takes a single doc
example of type dict
with str
key-value members.
def get_prompt(self, doc):
return ""
If the prompt involves few-shot examples, you first need to save them in a json <task_name>_few_shot_prompts.json
in bigcode_eval/tasks/few_shot_example
and then load them in fewshot_examples
method like this:
def fewshot_examples(self):
with open("bigcode_eval/tasks/few_shot_examples/<task_name>_few_shot_prompts.json", "r") as file:
examples = json.load(file)
return examples
The prompt will be sent to the languge model, and the generation will be evaluated against ground truth solutions or unit tests. You need to load them from the doc
in get_target
method.
def get_target(self, doc):
return ""
The solutions generated by the language model often require postprocessing to remove unececessary text and get executable code. This is done in the postprocess_generation
function. It takes as input the model generation generation
and the document index to which the generation belongs in the dataset idx
(this is not needed in most cases).
def postprocess_generation(self, generation, idx):
return ""
The evaluation happens in process_results
function. This function takes as argument the list of generations for all selected problems in the benchmark in generations
and their refernces in references
and returns a dictionary of metrics and their values.
def process_results(self, generations, references):
return {}
You need to load your metric and run it. Check Hugging Face evaluate
library for the available metrics. For example code_eval for pass@k, BLEU for BLEU score and apps_metric are implemented. If you cannot find your desired metric, you can either add it to the evaluate
library or implement it in the bigcode_eval/tasks/custom_metrics
folder and import it from there.
Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in bigcode_eval/tasks/__init__.py
and provide an entry in the TASK_REGISTRY
dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the file.
To run the entire test suite, use:
pytest
Few-shot tasks are easier to conduct, but if you need to add the finetuning script for your task, you can create a folder for it in finetuning
folder and use a similar training and evaluation script to the other tasks.
You can format your changes and perform black
standard checks
black bigcode_eval/tasks/<task-name>.py
Please document your task with advised parameters for execution from litterature in the docs like it's done for the other benchamrks.
Please specify in your pull request if you followed the orginal paper's approach to build the prompts or if some changes were introduced (especially if you build few shot examples). Ideally, you can evaluate some public models and compare the scores to the published results and see if they match.
If there are no published results for your task, make sure the evaluation works properly by testing some samples with a good code generation model such as InCoder-1B. During the experiments you have the option to save generation.json
and references.json
, take a look to see if the generations are properely cleaned and are somewhat close to the references for match-based evaluations for example.
Now push your work and make a pull request! Thanks for the contribution 🚀.