This repository accomapnies and produces the results from our paper. The research was conducted during the AI safety camp 2023 in the team consisting of Henning Bartsch, Ole Jorgensen, Domenic Rosati, Jason Hoelscher-Obermaier, and Jacob Pfau.
Abstract: Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67% to 82%, far higher than would be predicted if a model's consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models displaying both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self-consistent models internally compute multiple possible responses.
Best practice is to create a virtual environment and install relevant dependencies in there to develop and run the code.
conda create -n <env_name> python=3.10
conda activate <env_name>
pip install -r requirements.txt -r requirements_precommit.txt
Install pre-commit
in your local dev environment.
pip install pre-commit
pre-commit install
This makes applies various hooks before when you commit code. These hooks check the linting and formatting of your code and ensure that code is formatted properly following flake8
and black
standards.
You can read more here.
When you get stuck/annoyed by pre-commit rejecting your commit, you may choose to run git commit -m "your message" --no-verify
or -n
to skip the hooks. This is not recommended because it bypasses the linting and can introduce trouble for other devs.
Tests can be run with pytest
.
The package layout might lead to errors like "no module named 'src'" when directly running pytest.
To work around this invoke pytest as a python module or update Python path:
python -m pytest tests
To run a specific task, we simply specify it via the "task" parameter in the call to main.py:
python main.py +task=sequence_completion_equality # runs exactly that task ("+" before task needed for hydra syntax)
Note that a single "hydra run" is not the same as a single invocation of python main.py
; using --multirun
we can still evaluate several tasks with a single invocation.
For this we rely on the standard hydra --multirun
mechanism, as follows:
python main.py --multirun +task=sequence_completion_capability,sequence_completion_equality # '-m' can be used as shorthand for '--multirun'
This can be achieved by "packaging" each configuration to be swept over (including both the "task" parameter and the configuration for the chosen task) in a separate yaml file (conventionally under conf/experiments/
).
See conf/experiment/demo_1.yaml
and conf/experiment/demo_2.yaml
for extensively commented example files.
Once we have written down our experiment configs, we can do
python main.py -m +experiment=demo_1,demo_2
python main.py --multirun +task=sequence_completion_capability model=davinci,text-davinci-003,gpt-3.5-turbo-0301,gpt-4-0314
python main.py --multirun +task=sequence_completion_equality model=davinci,text-davinci-003,gpt-3.5-turbo-0301,gpt-4-0314,claude-v1
python main.py --multirun +task=compute_dependence_with_base_changes model=davinci,text-davinci-003,gpt-3.5-turbo-0301,gpt-4-0314
The goal of Q1.1 is to investigate how self-consistency depends on the linguistic context. We vary both what precisely we ask for, as well as who (which simulacrum) we ask it of.
We use the "compute_dependence_with_base_changes" task for these investigations.
a. Task / What we are asking for 1. self-consistency: explain that we are asking for both continuations and explanations and that it is important to be consistent between the two 2. maximum probability: explain only that it should provide the continuation or explanation which is most likely
b. Role / Who we are asking it of 1. self: "continue how you, <MODEL_NAME> would continue" 2. character: "continue the way a GPT-3 model / skilled human / etc would continue"
Prompts exploring these diferrent options are constructed as follows:
- We require a task prompt. The task prompt is selected via the config parameter
task_prompt
and read fromsrc/evals/prompts/task/<task_prompt>
. In each such directory, we have two filescontinuation.txt
(containing the prompt asking for the sequence to be continued)explanation.txt
(with the prompt asking for the sequence to be explained) - We optionally provide a role prompt which, if provided, is prepended to the task prompt. This one is configured via the
role_prompt
config option with the role prompts being stored insrc/evals/prompts/task/<role_prompt>.txt
(here we only need one file per role since the prefixed role prompt is independent of wheter we are asking for a continuation or explanation)
As a first experiment, we investigate whether asking the model explicitly to be self-consistent makes any difference, i.e. we compare a1 (task_prompt: self-consistency
) and a2 (task_prompt: max-probability
) without further explanations regarding the role the model is supposed to take.
python main.py -m +task=compute_dependence_with_base_changes task_prompt=self-consistency,max-probability
This eval addresses the consideration of alternative by obtaining log probabilities of different valid and invalid answers to a given ambiguous sequence. We wish to determine whether the model consistently allocates significant probability mass to valid options and what distribution over log probabilities of alternative answers can be observed.
python main.py -m +task=q2_1_logprob_inequality num_shots=4,6,8,10 seed=41,42,43
python main.py -m +task=q2_2_alternative_verbalization num_shots=4,6,8,10 model=text-davinci-003,gpt-3.5-turbo-0301,gpt-4-0314 seed=41,42,43