Skip to content

A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.

License

Notifications You must be signed in to change notification settings

THUNLP-MT/StableToolBench

Repository files navigation

favicon StableToolBench

ProjectServerSolvable QueriesInferenceStableToolEvalPaperCitation

Welcome to StableToolBench. Faced with the instability of Tool Learning benchmarks, we developed this new benchmark aiming to balance the stability and reality, based on ToolBench (Qin et al., 2023).

Note for Applying ToolBench keys

Note that if you have applied a ToolBench key but did not get a response for a long time, please contact Shihao Liang ([email protected]) for further assistance.

Updates

  • [2024.09.15] We found there exist some problems in the inference codes of ToolLLaMA v2 and we update model performance accordingly.
  • [2024.06.19] We update the OpenAI API to the newest version, which also support parallel function calling now. We also updated the model performance evaluation using gpt-4-turbo-2024-04-09, replacing gpt-4-turbo-preview, which we found may produce unstable evaluations. The inference results (run in Feb 2024) can be found on Huggingface.

Features

Based on the large scale of ToolBench, we introduce the following features to ensure the stability and reality of the benchmark:

  • Virtual API System, which comprises a caching system and API simulators. The caching system stores API call responses to ensure consistency, while the API simulators, powered by LLMs, are used for unavailable APIs. Note that we keep the large-scale diverse APIs environment from ToolBench.
  • A New Set of Solvable Queries. Query solvability is hard to determine on the fly, causing significant randomness and instability. In StableToolBench, we use state-of-the-art LLMs to determine task solvability to filter queries beforehand. We maintain the same query and answer format as ToolBench for seamless transition from it.
  • Stable Evaluation System: Implements a two-phase evaluation process using GPT-4 as an automatic evaluator. It involves judging the solvability of tasks and employing metrics like Solvable Pass Rate (SoPR) and Solvable Win Rate (SoWR).

The Virtual API Server

Our Virtual API server featured two components, the API simulation system with GPT 4 Turbo and the caching system. We provide two methods to use the virtual API system: building from source and using our prebuilt Docker.

Building from Source

Before you run any code, please first set up the environment by running pip install -r requirements.txt.

To start the server, you need to provide a cache directory and an OpenAI key.

Downloading the cache

We provide a cache to download from HuggingFace or Tsinghua Cloud. After downloading the cache, unzip the folder into the server folder and ensure the server folder contains tool_response_cache folder and tools folder. The resulting folder of server looks like:

├── /server/
│  ├── /tools/
│  │  └── ...
│  ├── /tool_response_cache/
│  │  └── ...
│  ├── config.yml
│  ├── main.py
│  ├── utils.py

Running the server directly

You need to first specify your configurations in server/config.yml before running the server. Parameters needed are:

  • api_key: The API key for OpenAI models.
  • api_base: The API base for OpenAI models if you are using Azure.
  • model: The OpenAI model to use. The default value is gpt-4-turbo-preview.
  • temperature: The temperature for LLM simulation. The default value is 0.
  • toolbench_url: The real ToolBench server URL. The default value is http://8.218.239.54:8080/rapidapi.
  • tools_folder: The tools environment folder path. Default to ./tools.
  • cache_folder: The cache folder path. Default to ./tool_response_cache.
  • is_save: A flag to indicate whether to save real and simulated responses into the cache. The new cache is saved at ./tool_response_new_cache.
  • port: The server port to run on, default to 8080.

Now you can run the server by running:

cd server
python main.py

The server will be run at http://localhost:{port}/virtual. To use the server, you will further need a toolbench key. You can apply one from this form.

Running the server using Docker

We provide a Dockerfile for easy deployment and consistent server environment. This allows you to run the server on various platforms that support Docker.

Prerequisites:

Building the Docker Image:

  1. Navigate to your project directory in the terminal.
  2. Build the Docker image using the following command:
docker build -t my-fastapi-server .  # Replace 'my-fastapi-server' with your desired image name
docker run -p {port}:8080 my-fastapi-server  # Replace 'my-fastapi-server' with your image name

Using the Prebuilt Docker Image

You can also use our prebuilt Docker image from Docker Hub hosted at https://hub.docker.com/repository/docker/zhichengg/stb-docker/general. Before running the docker, you will need to install docker and download the cache files as described in Building from Source. Then you can run the server using the following command:

docker pull zhichengg/stb-docker:latest
docker run -p {port}:8080 -v {tool_response_cache_path}:/app/tool_response_cache -v {tools_path}:/app/tools -e OPENAI_API_KEY= -e OPENAI_API_BASE= zhichengg/stb-docker

Remember to fill in the port, tool_response_cache_path, and tools_path with your own values. The OPENAI_API_KEY and OPENAI_API_BASE are the OpenAI API key and API base if you are using Azure. The server will be run at http://localhost:{port}/virtual.

Testing the Server

You can test the server with

import requests
import json
import os

url = 'http://0.0.0.0:8080/virtual'
data = {
    "category": "Media",
    "tool_name": "newapi_for_media",
    "api_name": "url",
    "tool_input": {'url': 'https://api.socialmedia.com/friend/photos'},
    "strip": "",
    "toolbench_key": ""
}
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
}

# Make the POST request
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.text)

Solvable Queries

The original queries are curated without considering the solvability but judging the solvability with ChatGPT on the fly will cause significant instability. Therefore, we judge the solvability of the original queries with the majority vote of gpt-4-turbo, gemini-pro and claude-2. The filtered queries are saved in solvable_queries.

Inference With Our StableToolBench Server

If you have not set up the environment, please first do so by running pip install -r requirements.txt. We currently implement all models and algorithms supported by ToolBench. We show ChatGPT (gpt-3.5-turbo-16k) with CoT as an example here. The script is also shown in inference_chatgpt_pipeline_virtual.sh. An example of the results is shown in data_example/answer.

To use ChatGPT, run:

export TOOLBENCH_KEY=""
export OPENAI_KEY=""
export OPENAI_API_BASE="" 
export PYTHONPATH=./
export GPT_MODEL="gpt-3.5-turbo-16k"
export SERVICE_URL="http://localhost:8080/virtual"
export OUTPUT_DIR="data/answer/virtual_chatgpt_cot"
group=G1_instruction
mkdir -p $OUTPUT_DIR; mkdir -p $OUTPUT_DIR/$group

python toolbench/inference/qa_pipeline_multithread.py \
    --tool_root_dir toolenv/tools \
    --backbone_model chatgpt_function \
    --openai_key $OPENAI_KEY \
    --max_observation_length 1024 \
    --method CoT@1 \
    --input_query_file solvable_queries/test_instruction/${group}.json \
    --output_answer_file $OUTPUT_DIR/$group \
    --toolbench_key $TOOLBENCH_KEY \
    --num_thread 1

StableToolEval

We follow the evaluation process of ToolBench. The difference is that we update the evaluation logic of the Pass Rate and Win Rate, resulting in the Solvable Pass Rate and Solvable Win Rate.

The first step is to prepare data. This step is the same as ToolEval in ToolBench. The following paragraph is adapted from ToolBench. To evaluate your model and method using ToolEval, you first need to prepare all the model predictions for the six test subsets. Create a directory naming with your model and method, e.g. chatgpt_cot then put each test set's predictions under the directory. The file structure of the directory should be:

├── /chatgpt_cot/
│  ├── /G1_instruction/
│  │  ├── /[email protected]
│  │  └── ...
│  ├── /G1_tool/
│  │  ├── /[email protected]
│  │  └── ...
│  ├── ...
│  ├── /G3_instruction/
│  │  ├── /[email protected]
│  │  └── ...

Then preprocess the predictions by running the following commands:

cd toolbench/tooleval
export RAW_ANSWER_PATH=../../data_example/answer
export CONVERTED_ANSWER_PATH=../../data_example/model_predictions_converted
export MODEL_NAME=virtual_chatgpt_cot
export test_set=G1_instruction

mkdir -p ${CONVERTED_ANSWER_PATH}/${MODEL_NAME}
answer_dir=${RAW_ANSWER_PATH}/${MODEL_NAME}/${test_set}
output_file=${CONVERTED_ANSWER_PATH}/${MODEL_NAME}/${test_set}.json

python convert_to_answer_format.py\
    --answer_dir ${answer_dir} \
    --method CoT@1 # DFS_woFilter_w2 for DFS \
    --output ${output_file}

Next, you can calculate the Solvable Pass Rate. Before running the process, you need to specify your evaluation OpenAI key in openai_key.json as follows:

[
    {
        "api_key": "your_openai_key",
        "api_base": "your_organization"
    },
    ...
]

Then calculate SoPR with :

cd  toolbench/tooleval
export API_POOL_FILE=../../openai_key.json
export CONVERTED_ANSWER_PATH=../../data_example/model_predictions_converted
export SAVE_PATH=../../data_example/pass_rate_results
mkdir -p ${SAVE_PATH}
export CANDIDATE_MODEL=virtual_chatgpt_cot
export EVAL_MODEL=gpt-4-turbo-preview
mkdir -p ${SAVE_PATH}/${CANDIDATE_MODEL}

python eval_pass_rate.py \
    --converted_answer_path ${CONVERTED_ANSWER_PATH} \
    --save_path ${SAVE_PATH}/${CANDIDATE_MODEL} \
    --reference_model ${CANDIDATE_MODEL} \
    --test_ids ../../solvable_queries_example/test_query_ids \
    --max_eval_threads 35 \
    --evaluate_times 3 \
    --test_set G1_instruction 

Note that we use gpt-4-turbo-preview as the standard evaluation model, which provided much better stability than gpt-3.5 series models.

The result files will be stored under the ${SAVE_PATH}.

Then you can calculate the SoWR. The below example takes ChatGPT-CoT as the reference model and ChatGPT-DFS as the candidate model. Note that you need to get both model's pass rate results first.

cd  toolbench/tooleval
export API_POOL_FILE=../../openai_key.json
export CONVERTED_ANSWER_PATH=../../data_example/model_predictions_converted
export SAVE_PATH=../../data_example/preference_results
export PASS_RATE_PATH=../../data_example/pass_rate_results
export REFERENCE_MODEL=virtual_chatgpt_cot
export CANDIDATE_MODEL=virtual_chatgpt_dfs
export EVAL_MODEL=gpt-4-turbo-preview
mkdir -p ${SAVE_PATH}


python eval_preference.py \
    --converted_answer_path ${CONVERTED_ANSWER_PATH} \
    --reference_model ${REFERENCE_MODEL} \
    --output_model ${CANDIDATE_MODEL} \
    --test_ids ../../solvable_queries_example/test_query_ids/ \
    --save_path ${SAVE_PATH} \
    --pass_rate_result_path ${PASS_RATE_PATH} \
    --max_eval_threads 10 \
    --use_pass_rate true \
    --evaluate_times 3 \
    --test_set G1_instruction

The result files will be stored under the ${SAVE_PATH}.

Model Experiments Results

Below are the main results (Inference done in Feb 2024). The win rate for each model is compared with ChatGPT-ReACT. We use gpt-4-turbo-2024-04-09 as the evaluator. Evaluation done in May 2024.

Note that the ToolLLaMA v2 performance is update on 15 Sep 2024 with the new inference codes. Legacy performance can be found here

Solvable Pass Rate:

Method I1 Instruction I1 Category I1 Tool I2 Category I2 Instruction I3 Instruction Average
GPT-3.5-Turbo-0613 (CoT) 52.2±1.1 47.3±0.6 53.6±1.3 42.5±2.1 35.8±2.0 48.1±0.8 46.6±1.3
GPT-3.5-Turbo-0613 (DFS) 60.3±1.3 66.2±1.2 67.1±0.0 59.1±0.4 51.3±1.2 73.8±2.3 63.0±1.1
GPT-4-0613 (CoT) 45.5±0.4 57.4±0.3 48.8±0.7 43.0±0.7 46.5±0.9 48.1±1.5 48.2±0.8
GPT-4-0613 (DFS) 57.3±0.6 57.3±0.3 60.9±1.0 57.9±1.0 51.3±0.8 66.4±2.4 58.5±1.0
ToolLLaMA v2 (CoT) 51.8±0.4 53.1±0.6 46.4±1.2 51.6±1.1 48.9±0.4 37.2±0.8 48.2±0.8
ToolLLaMA v2 (DFS) 61.0±1.8 58.8±0.5 45.6±0.9 60.3±1.3 53.5±1.8 48.1±1.5 54.6±1.3
GPT-3.5-Turbo-1106 (CoT) 50.4±0.5 45.1±1.4 50.8±0.3 48.7±0.8 42.1±0.4 55.7±0.0 48.8±0.6
GPT-3.5-Turbo-1106 (DFS) 62.8±0.3 63.9±1.2 65.6±0.3 56.5±0.7 56.9±1.2 67.2±1.3 62.2±0.8
GPT-4-Turbo-Preview (CoT) 52.8±1.3 56.6±0.9 51.9±0.5 51.9±1.0 52.8±0.8 52.5±0.0 53.1±0.8
GPT-4-Turbo-Preview (DFS) 59.2±0.5 61.7±0.7 65.7±1.0 55.6±0.6 55.2±0.4 66.1±4.3 60.6±1.3

In this experiment, we run all models once, evaluate them three times, and take the average results.

Solvable Win Rate: (Reference model: ChatGPT-CoT)

Method I1 Instruction I1 Category I1 Tool I2 Instruction I2 Category I3 Instruction Average
GPT-3.5-Turbo-0613 (DFS) 60.7 67.3 59.5 63.2 62.1 75.4 64.7
GPT-4-0613 (CoT) 54.6 58.8 58.2 75.5 60.5 62.3 61.7
GPT-4-0613 (DFS) 62.6 62.7 58.2 74.5 62.9 67.2 64.7
ToolLLaMA v2 (CoT) 41.7 45.1 32.3 52.8 46.8 26.2 40.8
ToolLLaMA v2 (DFS) 42.3 51.0 31.0 67.0 54.0 31.1 54.0
GPT-3.5-Turbo-1106 (CoT) 47.2 47.7 44.9 50.9 54.0 62.3 51.2
GPT-3.5-Turbo-1106 (DFS) 55.8 53.6 51.9 68.9 59.7 68.9 59.8
GPT-4-Turbo-Preview (CoT) 71.2 77.1 61.4 79.2 71.8 67.2 71.3
GPT-4-Turbo-Preview (DFS) 73.0 75.2 68.4 77.4 66.9 60.7 70.2
We run all models once against GPT-3.5-Turbo-0613 + CoT and evaluate them three times. We follow the ToolBench implementation to take the most frequent result for each query during evaluation.

Acknowledgement

We thank Jingwen Wu and Yao Li for their contributions to experiments and result presentation. We also appreciate Yile Wang and Jitao Xu for their valuable suggestions during discussions.

Citation

@misc{guo2024stabletoolbench,
      title={StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models}, 
      author={Zhicheng Guo and Sijie Cheng and Hao Wang and Shihao Liang and Yujia Qin and Peng Li and Zhiyuan Liu and Maosong Sun and Yang Liu},
      year={2024},
      eprint={2403.07714},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}