📃 Paper • 🤗 Models & Datasets
This repository contains the code, models, and data for our paper "FactAlign: Long-form Factuality Alignment of Large Language Models" accepted at EMNLP 2024 Findings. Please cite the following reference if you use the code or models.
@inproceedings{huang2024infactalign,
title={{FactAlign}: Long-form Factuality Alignment of Large Language Models},
author={Chao-Wei Huang and Yun-Nung Chen},
year={2024},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024 (Findings of EMNLP 2024)}
}
FactAlign is a alignment framework designed to enhance the factuality of LLMs' long-form responses. FactAlign leverages recent advances in automatic factuality assessment to guide the alignment process. Additionally, we introduce fKTO, a fine-grained, sentence-level alignment algorithm that extends the Kahneman-Tversky Optimization (KTO) alignment method.
FactAlign significantly improves the factual accuracy of LLM responses on benchmarks such as LongFact and FactScore.
Make a new Python 3.9+ environment using virtualenv
or conda
.
conda create -n fact-align python=3.10
conda activate fact-align
# Install python dependencies. We specify the versions in the requirements.txt file, but newer versions should work generally okay.
pip install -r requirements.txt
We also use the alignment-handbook
package for the alignment algorithms. Install it using the following command:
git clone https://github.com/huggingface/alignment-handbook.git
cd ./alignment-handbook/
python -m pip install .
Note that we used this commit of the alignment-handbook
package. Newer versions should generally work.
The datasets we generated for training FactAlign, including the long-form responses and the corresponding factuality assessments, are available in our Huggingface collection.
In order to generate the datasets, we use the adapted version of Search-Augmented Factuality Evaluator (SAFE) from Google Deepmind.
Please navigate to the long-form-factuality
directory and refer to the README for more details on how to generate the datasets.
First, modify the configuration files in the configs
directory to make sure it fits your local machine.
We used DeepSpeed Zero2 to train gemma-2b
and Phi3-Mini
models on 2xV100 32GB, and DeepSpeed Zero3 for training the LLaMA-3-8B
models on 4xA100 40GB. Please modify the deepspeed_config_file
path in the configs/deepspeed_zero*.yaml
files to fit your local machine.
The configs/kto_*deepspeed.yaml
files are the configurations for training the FactAlign model. You can adjust the hyperparameters in these files.
kto_trainer_fg.py
contains the implementation of the fKTO trainer. The FGKTOTrainer
class extends the KTOTrainer
class from the trl
package.
The dataset for training the FactAlign model with fine-grained factuality assessment should be in the following format:
{
"prompt": [
{
"content": "What is the geographical importance of the Strait of Gibraltar? Provide as many specific details and examples as possible (such as names of people, numbers, events, locations, dates, times, etc.)",
"role": "user"
}
],
"completion": [
{
"content": "The Strait of Gibraltar is a vital waterway that connects the Atlantic Ocean to the Mediterranean Sea, separating the Iberian Peninsula from the African continent...",
"role": "assistant"
}
],
"label": true,
"completion_sentences": [
"The Strait of Gibraltar is a vital waterway...",
"Its geographical importance is multifaceted...",
"The Strait of Gibraltar is approximately 14 kilometers...",
"It is situated at the westernmost point..."
],
"sentence_label": [
true,
true,
true,
false
]
}
where label
is the factuality assessment of the whole completion, completion_sentences
are the sentences in the completion, and sentence_label
is the factuality assessment of each sentence.
You can find the prepared datasets in our Huggingface collection.
To train the FactAlign model, run the following command:
bash train_kto.sh
The trained model will be saved in the output_dir
specified in the configuration file.
We used the LongFact and FactScore benchmarks to evaluate the performance of FactAlign.
FactAlign significantly improves the factual accuracy of LLM responses on these benchmarks.
For LongFact, we used the adapted SAFE evaluation script. Please refer to the README for more details.
For FactScore, we used the forked version of the official FactScore evaluation script, which supports up-to-date OpenAI API. Please refer to their repository for more details.