This is the code for the Thesis "ITRF: Instruction Tuned Retriever Finetuning".
The model weights are available on HuggingFace
To load the model simply use the model/ folder where the LLM and the reranker can be loaded. Models will be automatically downloaded from huggingface.
To just run the inference and interact with the pipeline, please use the main.py script.
To recreate the corpos for this work
- follow the instructions in the Meta ATLAS Repo to retrieve their wikipedia dump and load them under ./data/corpora/
- Start the vector store docker container using the script at create_corpus/vector_store/
- Please make sure to create a collection called "retriever" to disable indexing before uploading the batches and enable it afterwards. This can be done using the jupyter notebook to be found at create_corpus/.
- Use the scripts provided in "create_corpus" to load the wikipedia as well as the CC dump into the vector database
The shell script will start four workers which each load transform split the dataset into chunks and embed as well as upload them to the vector store in batches of 1000 chunks.
For the loading of the complete ATLAS wikipedia dump consisting of about 33mio samples it took ~9,5h using 4x Nvidia RTX 3090.
For recreating the dataset you can run the script in create_dataset/. Please note that for the notebook to work, the retriver corpus container must be already working.
The dataset can also be retrieved from HuggingFace.
from datasets import load_dataset
dataset = load_dataset("tristanratz/itrf", "llm")
Before running the dataset creation for the reranker make sure that you trained the LLM beforehand.
For training the models please use the python scripts in train_llm/. Please make sure to have the dataset at hand.
To accelerate the training process and to use multiple GPUs consider configuring accelerate using
accelerate config --config_file <CONFIG_FILE>
We optimized and ran our training using deepspeed zero and multiple optimization techniques (more details see train_llm/README) to especially run on 4x RTX 3090.
To run the training for the 13b-qlora variant of our model use "train_llm/run_training.sh".
Content to follow
In the evaluation folder the different steps consisting of retrieval, reranking, generation and final ragas evaluation can be found as well as the results of our runs.