Skip to content

Orange-OpenSource/DistFactAssessLM

Repository files navigation

Factual Knowledge Assessment of Language Models Using Distractors

image

Paper

Language models encode extensive factual knowledge within their parameters. The accurate assessment of this knowledge is crucial for understanding and improving these models. In the literature, factual knowledge assessment often relies on cloze sentences (e.g., "The capital of France ____"), which can lead to erroneous conclusions due to the complexity of natural language (out-of-subject continuations, the existence of many correct answers and the several ways of expressing them).

We introduced a new interpretable knowledge assessment method that mitigates these issues by leveraging distractors which are incorrect but plausible alternatives to the correct answer. Our method is evaluated against existing approaches, demonstrating solid alignment with human judgment and stronger robustness to verbalization artifacts.

This repository aims to permit the reproduction of the results of our paper.

Setup environment

This section introduces all the steps required to setup the environment needed to launch the experiments described in our paper.

Software

Make sure you have these softwares installed

  • OS : Ubuntu 22.04 -- can maybe work on Windows (not tested)
  • conda (version used : 24.9.2)
  • MongoDB (version used : 7.0.3) -- Don't forget to start it

Environment variables

  • STORAGE_FOLDER: Set this variable to represent the path to the folder where to store intermediate files.
  • MONGO_URL: If MongoDB does not run locally, specify its URL in this variable. Else, do nothing.

Plan 150GB of disk storage in this folder and 150GB for MongoDB.

Install Virual Environment

Create and activate the conda environment wfd_build

bash setup_env/wfd_build.sh
conda activate wfd_build

Collect data for experiments

Launch all following commands from the root of the project

  1. List the available Wikidata dumps that are available for download:
    PYTHONPATH=src python list_available_wikidata_dumps.py
    
  2. Choose one date (format: YYYYMMDD) and pass it as an argument in this script:
    PYTHONPATH=src python download_wikidata.py --date DATE
    

This will first download the Wikidata dump, push it to MongoDB, and then preprocess it. This will take a while (>10h).

Reproducibility note: If you want to reproduce the results of our paper choose the date to be 20210104.

Initialize distractor retrievers

Run this command

PYTHONPATH=src python init_distractor_retrievers.py --date DATE

where DATE is the same date used in the previous step.

Congratulations! You are ready to proceed to experiments.

Experiments

Sampling facts

  1. Sample random facts (noted $S$ in the paper) using the following command:
PYTHONPATH=src python sample_facts.py --type random --date DATE
  1. Sample facts that have atleast one temporal distractor (noted $S'$ in the paper) using the following command:
PYTHONPATH=src python sample_facts.py --type tempdist --date DATE

Reproducibility note: The exact same facts sample used in our experiments are provided in the folder reproducibility. To use them, simply copy the two files inside this folder to $STORAGE_FOLDER.

Compare distractor retrieval strategies

To compare the different distractor retrieval strategies proposed in the paper, launch the following commands:

PYTHONPATH=src python run_experiment.py --experiment compare_retrieval_strategies --date DATE
PYTHONPATH=src python run_experiment.py --experiment compare_retrieval_strategies_dist_temp --date DATE

The result analysis can be generated by executing the Jupyter notebook scripts/general_eval_know_measure/compare_retrieval_strategies_analysis.ipynb.

To choose what results to analyze, set the variable TEMPORAL_DISTRACTOR_RESULT_ANALYSIS at the beginning of the notebook. If it is set to True, the results on the set $S'$ are analyzed, else, $S$ is analyzed (see the paper for more information).

Compare our knowledge measure with other baselines

The other studied baselines are: KaRR (our implementation), Probability, BERT-score, ROUGE-L, and LLM-as-a-judge.

First, launch the following command:

PYTHONPATH=src python run_experiment.py --experiment compare_kms --date DATE

Then analyze the results in the notebook situated in scripts/optimize_know_evals/optimize.ipynb which contains the alignment of each knowledge measure with human judgment, and its robustness to verbalization artifacts.

Note: The annotation dataset of verbalization errors is in scripts/taxonomy/taxonomy_hichem.csv and its statistical analysis in scripts/taxonomy/analyze.ipynb.

How to measure factual knowledge within langauge models with our method?

Here is an example of how to measure the knowledge of the fact "The capital of France is Paris" by GPT2:

import numpy as np
from kb.core import Entity, Relation, Triple, TripleComp
from kb.wikidata import TempWikidata, WikidataPrepStage
from know.core import DistKnowMeasure, StrictMetric
from know.distractor_find import ApproxIdealDistractorFinder
from lm.core import LanguageModel, LogProbability
from verb.core import VerbalizeConfig
from verb.wikifactdiff import WikiFactDiffVerbalizer


wd = TempWikidata('20210104', WikidataPrepStage.PREPROCESSED)
lm = LanguageModel.from_pretrained_name("gpt2", 'cuda')

strict_agg = StrictMetric([20])
dist_finder = ApproxIdealDistractorFinder(wd, lm)

assert dist_finder.built(), "Distractor finder %s must be built first before executing this script!" % dist_finder
dist_finder.load()

cred_func = LogProbability()
know_strict = DistKnowMeasure(strict_agg, dist_finder, cred_func, compute_cred_on_object=True)

# In this example we are testing the knowledge of GPT2 on the fact (France, capital, Paris)
subject, relation, object = Entity('Q142'), Relation('P36'), Entity('Q90')
fact = Triple(subject, relation, object)

# Inject label information in the triple
wd.inject_info(fact)

verbalizer = WikiFactDiffVerbalizer()
config = VerbalizeConfig(
    max_num_verbs=1,
    verb_tense=None,
    ends_with=TripleComp.OBJECT
)

temp = verbalizer.verbalize(fact, config, skip_failed_conjugation=True)
measure = know_strict.measure_temp(lm, temp)

print('The score of GPT2 on the fact %s is: %.2f (value between 0 and 1)' % (fact, measure.result[0]))

Note: Modify this script to use our method on other facts and/or other language models.

Having an issue with our code?

If you have a problem running our code, please let us know by opening an issue ;)

How to cite our work?

PAPER INCOMING

Licence

Look for the LICENCE.txt file at the root of this project

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published