Conversational Papers (cPAPERS): A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers
This repository contains the code and dataset for the paper:
Conversational Papers (cPAPERS): A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers
ArXiv link
The cPAPERS dataset is available on Hugging Face.
Conversational Papers (cPAPERS) is a dataset of conversations in English situated in scientific texts. cPAPERS consists of question-answer pairs pertaining to figures (cPAPERS-FIGS), equations (cPAPERS-EQNS), or tabular information (cPAPERS-TBLS) from scientific papers.
cPAPERS is designed to facilitate research on interactive conversations within the context of scientific papers. It includes question-answer pairs, extracted figures, tables, equations in LaTeX format, and their surrounding context and references in the papers.
To collect question-answer pairs and other relevant data, follow these steps:
-
Navigate to the data collection folder:
cd data_collection
-
Download papers, comments, and reviews from OpenReview:
python collect_data.py
-
Extract QA pairs using LLaMA+GPT:
python extract_qas.py
-
Clean the QA pairs:
python clean_and_extract_num.py
-
Retrieve figures, tables, and equations and their surrounding context from the downloaded
.tex
files:python get_figures.py python get_equations.py python get_tables.py
-
Match QA pairs with the extracted figures, tables, and equations data using
paper_id
:python match_data.py
-
Convert
.pdf
and.eps
files to.png
:python convert_figures.py
To reproduce the experiments discussed in the paper, use the following commands:
- Run zero-shot experiments for equations:
./run_zs_equation.sh
- Run zero-shot experiments for figures:
./run_zs_figure.sh
- Run zero-shot experiments for tables:
./run_zs_table.sh
- Run fine-tuning experiments for equations:
./run_ft_equation.sh
- Run fine-tuning experiments for figures:
./run_ft_figure.sh
- Run fine-tuning experiments for tables:
./run_ft_table.sh
- Run ablation experiment on the temperature for equations:
./run_zs_equation_temperature.sh
- Run ablation experiment on the temperature for figures:
./run_zs_figure_temperature.sh
- Run ablation experiment on the temperature for tables:
./run_zs_table_temperature.sh
If you use this dataset in your research, please cite our paper:
@article{sundar2024cpapers,
title={cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers},
author={Sundar, Anirudh and Xu, Jin and Gay, William and Richardson, Christopher and Heck, Larry},
journal={arXiv preprint arXiv:2406.08398},
year={2024}
}