Language models encode extensive factual knowledge within their parameters. The accurate assessment of this knowledge is crucial for understanding and improving these models. In the literature, factual knowledge assessment often relies on cloze sentences (e.g., "The capital of France ____"), which can lead to erroneous conclusions due to the complexity of natural language (out-of-subject continuations, the existence of many correct answers and the several ways of expressing them).
We introduced a new interpretable knowledge assessment method that mitigates these issues by leveraging distractors which are incorrect but plausible alternatives to the correct answer. Our method is evaluated against existing approaches, demonstrating solid alignment with human judgment and stronger robustness to verbalization artifacts.
This repository aims to permit the reproduction of the results of our paper.
This section introduces all the steps required to setup the environment needed to launch the experiments described in our paper.
Make sure you have these softwares installed
- OS : Ubuntu 22.04 -- can maybe work on Windows (not tested)
- conda (version used : 24.9.2)
- MongoDB (version used : 7.0.3) -- Don't forget to start it
- STORAGE_FOLDER: Set this variable to represent the path to the folder where to store intermediate files.
- MONGO_URL: If MongoDB does not run locally, specify its URL in this variable. Else, do nothing.
Plan 150GB of disk storage in this folder and 150GB for MongoDB.
Create and activate the conda environment wfd_build
bash setup_env/wfd_build.sh
conda activate wfd_build
Launch all following commands from the root of the project
- List the available Wikidata dumps that are available for download:
PYTHONPATH=src python list_available_wikidata_dumps.py
- Choose one date (format: YYYYMMDD) and pass it as an argument in this script:
PYTHONPATH=src python download_wikidata.py --date DATE
This will first download the Wikidata dump, push it to MongoDB, and then preprocess it. This will take a while (>10h).
Reproducibility note: If you want to reproduce the results of our paper choose the date to be 20210104.
Run this command
PYTHONPATH=src python init_distractor_retrievers.py --date DATE
where DATE is the same date used in the previous step.
Congratulations! You are ready to proceed to experiments.
- Sample random facts (noted
$S$ in the paper) using the following command:
PYTHONPATH=src python sample_facts.py --type random --date DATE
- Sample facts that have atleast one temporal distractor (noted
$S'$ in the paper) using the following command:
PYTHONPATH=src python sample_facts.py --type tempdist --date DATE
Reproducibility note: The exact same facts sample used in our experiments are provided in the folder reproducibility
. To use them, simply copy the two files inside this folder to $STORAGE_FOLDER.
To compare the different distractor retrieval strategies proposed in the paper, launch the following commands:
PYTHONPATH=src python run_experiment.py --experiment compare_retrieval_strategies --date DATE
PYTHONPATH=src python run_experiment.py --experiment compare_retrieval_strategies_dist_temp --date DATE
The result analysis can be generated by executing the Jupyter notebook scripts/general_eval_know_measure/compare_retrieval_strategies_analysis.ipynb
.
To choose what results to analyze, set the variable TEMPORAL_DISTRACTOR_RESULT_ANALYSIS at the beginning of the notebook. If it is set to True
, the results on the set
The other studied baselines are: KaRR (our implementation), Probability, BERT-score, ROUGE-L, and LLM-as-a-judge.
First, launch the following command:
PYTHONPATH=src python run_experiment.py --experiment compare_kms --date DATE
Then analyze the results in the notebook situated in scripts/optimize_know_evals/optimize.ipynb
which contains the alignment of each knowledge measure with human judgment, and its robustness to verbalization artifacts.
Note: The annotation dataset of verbalization errors is in scripts/taxonomy/taxonomy_hichem.csv
and its statistical analysis in scripts/taxonomy/analyze.ipynb
.
Here is an example of how to measure the knowledge of the fact "The capital of France is Paris" by GPT2:
import numpy as np
from kb.core import Entity, Relation, Triple, TripleComp
from kb.wikidata import TempWikidata, WikidataPrepStage
from know.core import DistKnowMeasure, StrictMetric
from know.distractor_find import ApproxIdealDistractorFinder
from lm.core import LanguageModel, LogProbability
from verb.core import VerbalizeConfig
from verb.wikifactdiff import WikiFactDiffVerbalizer
wd = TempWikidata('20210104', WikidataPrepStage.PREPROCESSED)
lm = LanguageModel.from_pretrained_name("gpt2", 'cuda')
strict_agg = StrictMetric([20])
dist_finder = ApproxIdealDistractorFinder(wd, lm)
assert dist_finder.built(), "Distractor finder %s must be built first before executing this script!" % dist_finder
dist_finder.load()
cred_func = LogProbability()
know_strict = DistKnowMeasure(strict_agg, dist_finder, cred_func, compute_cred_on_object=True)
# In this example we are testing the knowledge of GPT2 on the fact (France, capital, Paris)
subject, relation, object = Entity('Q142'), Relation('P36'), Entity('Q90')
fact = Triple(subject, relation, object)
# Inject label information in the triple
wd.inject_info(fact)
verbalizer = WikiFactDiffVerbalizer()
config = VerbalizeConfig(
max_num_verbs=1,
verb_tense=None,
ends_with=TripleComp.OBJECT
)
temp = verbalizer.verbalize(fact, config, skip_failed_conjugation=True)
measure = know_strict.measure_temp(lm, temp)
print('The score of GPT2 on the fact %s is: %.2f (value between 0 and 1)' % (fact, measure.result[0]))
Note: Modify this script to use our method on other facts and/or other language models.
If you have a problem running our code, please let us know by opening an issue ;)
PAPER INCOMING
Look for the LICENCE.txt file at the root of this project