[Model card] [ViDoRe Leaderboard] [Demo] [Blog Post]
The Visual Document Retrieval Benchmark (ViDoRe), is introduced to evaluate the performance of document retrieval systems on visually rich documents across various tasks, domains, languages, and settings. It was used to evaluate the ColPali model, a VLM-powered retriever that efficiently retrieves documents based on their visual content and textual queries using a late-interaction mechanism.
Tip
If you want to fine-tune ColPali for your specific use-case, you should check the colpali
repository. It contains with the whole codebase used to train the model presented in our paper.
We used Python 3.11.6 and PyTorch 2.2.2 to train and test our models, but the codebase is expected to be compatible with Python >=3.9 and recent PyTorch versions.
The eval codebase depends on a few Python packages, which can be downloaded using the following command:
pip install vidore-benchmark
Tip
By default, the vidore-benchmark
package already includes the dependencies for the ColVision models (e.g. ColPali, ColQwen2...).
To keep a lightweight repository, only the essential packages were installed. In particular, you must specify the dependencies for the specific non-Transformers models you want to run (see the list in pyproject.toml
). For instance, if you are going to evaluate the BGE-M3 retriever:
pip install "vidore-benchmark[bge-m3]"
Or if you want to evaluate all the off-the-shelf retrievers:
pip install "vidore-benchmark[all-retrievers]"
Note that in order to use BM25Retriever
, you will need to download the nltk
resources too:
pip install "vidore-benchmark[bm25]"
python -m nltk.downloader punkt punkt_tab stopwords
The list of available retrievers can be found here. Read this section to learn how to create, use, and evaluate your own retriever.
You can evaluate any off-the-shelf retriever on the ViDoRe benchmark. For instance, you can evaluate the ColPali model on the ViDoRe benchmark to reproduce the results from our paper.
vidore-benchmark evaluate-retriever \
--model-class colpali \
--model-name vidore/colpali-v1.2 \
--collection-name vidore/vidore-benchmark-667173f98e70a1c0fa4db00d \
--split test
Alternatively, you can evaluate your model on a single dataset. If your retriver uses visual embeddings, you can use any dataset path from the ViDoRe Benchmark collection, e.g.:
vidore-benchmark evaluate-retriever \
--model-class colpali \
--model-name vidore/colpali-v1.2 \
--dataset-name vidore/docvqa_test_subsampled \
--split test
If you want to evaluate a retriever that relies on pure-text retrieval (no visual embeddings), you should use the datasets from the ViDoRe Chunk OCR (baseline) instead:
vidore-benchmark evaluate-retriever \
--model-class bge-m3 \
--model-name BAAI/bge-m3 \
--dataset-name vidore/docvqa_test_subsampled_tesseract \
--split test
All the above scripts will generate a JSON file in outputs/{model_id}_metrics.json
. Follow the instructions on the ViDoRe Leaderboard to learn how to publish your results on the leaderboard too!
To have more control over the evaluation process (e.g. the batch size used at inference), read the CLI documentation using:
vidore-benchmark evaluate-retriever --help
In particular, feel free to play with the --batch-query
, --batch-passage
, --batch-score
, and --num-workers
inputs to speed up the evaluation process.
While the CLI can be used to evaluate a fixed list of models, you can also use the Python API to evaluate your own retriever. Here is an example of how to evaluate the ColPali model on the ViDoRe benchmark. Note that your processor must implement a process_images
and a process_queries
methods, similarly to the ColVision processors.
from typing import Dict, Optional
import torch
from colpali_engine.models import ColIdefics3, ColIdefics3Processor
from datasets import load_dataset
from tqdm import tqdm
from vidore_benchmark.evaluation.vidore_evaluators import ViDoReEvaluatorQA
from vidore_benchmark.retrievers import VisionRetriever
from vidore_benchmark.utils.data_utils import get_datasets_from_collection
model_name = "vidore/colSmol-256M"
processor = ColIdefics3Processor.from_pretrained(model_name)
model = ColIdefics3.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda",
).eval()
# Get retriever instance
vision_retriever = VisionRetriever(
model=model,
processor=processor,
)
vidore_evaluator = ViDoReEvaluatorQA(vision_retriever)
# Evaluate on a single dataset
dataset_name = "vidore/tabfquad_test_subsampled"
ds = load_dataset(dataset_name, split="test")
metrics_dataset = vidore_evaluator.evaluate_dataset(
ds=ds,
batch_query=4,
batch_passage=4,
batch_score=4,
)
# Evaluate on a local directory or a HuggingFace collection
collection_name = "vidore/vidore-benchmark-667173f98e70a1c0fa4db00d" # ViDoRe Benchmark
dataset_names = get_datasets_from_collection(collection_name)
metrics_collection: Dict[str, Dict[str, Optional[float]]] = {}
for dataset_name in tqdm(dataset_names, desc="Evaluating dataset(s)"):
metrics_collection[dataset_name] = vidore_evaluator.evaluate_dataset(
ds=load_dataset(dataset_name, split="test"),
batch_query=4,
batch_passage=4,
batch_score=4,
)
If you want to evaluate your own retriever to use it with the CLI, you should clone the repository and add your own class that inherits from BaseVisionRetriever
. You can find the detailed instructions here.
To easily process, visualize and compare the evaluation metrics of multiple retrievers, you can use the EvalManager
class. Assume you have a list of previously generated JSON metric files, e.g.:
data/metrics/
├── bisiglip.json
└── colpali.json
The data is stored in eval_manager.data
as a multi-column DataFrame with the following columns. Use the get_df_for_metric
, get_df_for_dataset
, and get_df_for_model
methods to get the subset of the data you are interested in. For instance:
from vidore_benchmark.evaluation import EvalManager
eval_manager = EvalManager.from_dir("data/metrics/")
df = eval_manager.get_df_for_metric("ndcg_at_5")
ColPali: Efficient Document Retrieval with Vision Language Models
Authors: Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution)
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
If you want to reproduce the results from the ColPali paper, please read the REPRODUCIBILITY.md
file for more information.