This is the benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame.
🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot.
The dataset can be found on our HuggingFace page. It is based on the MatPlotLib gallery.
The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1.
The paper_supp_info
directory contains the supplementary materials for our paper, as well as a file with a demonstration of data points (tasks and plots).
📩 If you have any questions or requests concerning this dataset, please contact the author at [email protected].
- Clone the repo:
git clone https://github.com/JetBrains-Research/PandasPlotBench.git
- Navigate to the directory:
cd plotting-benchmark
- Run:
poetry install
- Important: if you're going to use benchmarking on a local machine (includes using
code_bert_score
), runpoetry install --extras "local_gpu"
instead.
- Important: if you're going to use benchmarking on a local machine (includes using
- Edit the config if needed (
configs/config.yaml
). - Set up environment variables for the proprietary model keys if necessary (see details in the Usage section).
- Run the benchmark (see details in the Usage section):
poetry run python run_benchmark.py
You can run the benchmark on a subset of the datapoints by passing the --limit
parameter with either the number of datapoints to run or a list of IDs:
poetry run python run_benchmark.py --limit=2
Each datapoint contains a plotting task, a small CSV with the data to be plotted, and the ground truth images. Each task is divided into two parts:
- Plot description. The main part, describing the target plot.
- Plot style description. General guidelines for the styling of the plot.
The tasks can be changed dynamically using the TaskChanger
class (see the Usage section).
The dataset can be loaded via load_dataset
:
from datasets import load_dataset
dataset = load_dataset("JetBrains-Research/plot_bench", split="test")
For the code generation models, you can use three options:
- VLLM. Just pass the HuggingFace model name in the
model_plot_gen.names
list. - OpenAI models. Add "openai/" prefix to the OpenAI model name to select this option. In this case, you should set the
OPENAI_KEY
environment variable with a corresponding token. - TogetherAI models. Add "together/" prefix to the TogetherAI model name to select this option. In this case, you should set the
TOGETHERAI_KEY
environment variable with a corresponding token.
For image-based scoring, we use the OpenAI GPT4-v model (the default is gpt-4o-2024-05-13
). Thus, you have to set the OPENAI_KEY
environment variable with a corresponding token.
You can provide the keys in the .env
file at the root of the repository, and they will be loaded automatically.
from plotting_benchmark.benchmark import PlottingBenchmark
benchmark = PlottingBenchmark(config_path="configs/config.yaml")
benchmark.run_benchmark()
ids
— limits datapoints IDs to be benchmarked: e.g.,ids = [3, 5, 7]
reuse_results
— ifTrue
, does not generate plots, reuses results saved inresults_filename
.load_intermediate
— ifTrue
, does not generate plots, loads intermediate results fromcurrent_results.jsonl
that stores intermediate results in the case of a crash.only_stats
— ifTrue
, does not run benchmarking, and rather just calculates the stats fromresults_filename
.
The config template and LLM instructs can be found in the plotting_benchmark/resources
directory.
The results are saved in the out_folder
that is set up in the config.
For each benchmarked model, the following files are saved:
results_{modelname}_{plottinglib}_{df_descriptor}.json
— a dataset with the results for each datapoint (plots in encoded PNG, scores, generated code).all_plotsall_plots_{modelname}_{plottinglib}_{df_descriptor}.ipynb
— a notebook with all the plots of the dataset (code, figures, possible errors).benchmark_stat.jsonl
— statistics for the benchmark scores. The results of each model start from new line.
You can experiment with the wording of tasks, i.e., change data description or the setup part to control plotting libraries.
To do that, create a CustomTaskChanger
inheriting from TaskChanger
. Here is a template for a custom task changer:
import pandas as pd
from plotting_benchmark.benchmark import PlottingBenchmark
from plotting_benchmark.task_changer import TaskChanger
class MyTaskChanger(TaskChanger):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def setup_changer(self, task_text: str, df: pd.DataFrame) -> str:
return "Use assembler language and [PLOLIB] library to draw a plot"
def data_descr_changer(self, task_text: str, df: pd.DataFrame) -> str:
return generate_custon_dataframe_description(task_text, df)
def plot_descr_changer(self, task_text: str, df: pd.DataFrame) -> str:
# Be carefull with that - it is the main task, describing the plot.
return task_text
def style_changer(self, task_text: str, df: pd.DataFrame) -> str:
return "Draw a beautiful plot"
benchmark = PlottingBenchmark(
config_path="configs/config.yaml", task_changer_class=MyTaskChanger
)
benchmark.run_benchmark()
- Our approach relies on running LLM-generated code. In addition to safety, there is an issue of having the right libraries installed. The generated code could use libraries that are not installed, and this will lead to failing to build a plit. Perhaps, we should list the installed graphical libraries in the prompt.
time_used_per_item
in statistics includes the waiting time in a case of a time-out.