Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs [Paper]
This repository hosts the codes of our logic scaffolding inferential rule generation framework for primitive rule generation and rule composition, and the data of our generated inferential rule base ULogic.
You can directly run primitive_rule_pipeline.ipynb
and rule_composition.ipynb
respectively for primitive rule generation and rule composition in object interaction domain (for illustration). The functions (or step number and name) of each module in our scripts are clearly commentted before each section.
For rule probing in terms of different questions in Section 3.2, please run limit_analysis.ipynb
.
Before running, you need to set your API key for OpenAI.
Data/ulogic.json
provides all inferential rules generated by our framework while Data/probing_subset.json
provides a high-quality author verified subset of ULogic for rule probing of LLMs.
Each rule is formatted as following:
{
"s_rule": "CanObserve(Person X, Animal Y):- MoveTo(Person X, Region Z), Born(Animal Y, Region Z);",
"v_rule": "If Person X moves to Region Z and Animal Y is born in Region Z, then Person X can observe Animal Y.",
"domain": "accessibility",
"depth": 0,
"length": 2,
"positive": true,
"structure": "disjunctive",
"label": true,
"original_human_prediction": "2",
"flipped_human_prediction": "1"
}
"label": true
denotes that this instance is expert verified (by paper authors).
"original_human_prediction"
and "flipped_human_prediction"
are the probing results of AMT human over the original rules and their flipped counterparts (The definition can be found in the second paragraph of Sec. 3.1.). "original_human_prediction": "2"
and "flipped_human_prediction": "1"
mean that the original rule is classified as False/Wrong/No by human while the flipped rule is classified as True/Right/Yes.
The training and validation data for rule distillation are provided in the RuleDistillation/Data/general
folder. (training_data_d0.json is the training data while training_data_5000.json is the validationd data.)
- For the training of Inference Engine, please run the following script:
cd RuleDistillation
bash run_finetune.sh
- During the inference stage, please run this script (The test data are provided in the
RuleDistillation/Data/human/processed
folder, respectively for conlcusion generation, premise completion and premise generation):
bash run_generate.sh
We provide a demo for our inference engine.
Considering the security of external access ports on the server, our demo is no longer accessible.
We have uploaded the streamlit implementation of our web demo. You can train a model following above rule distillation, and run
CUDA_VISIBLE_DEVICES=0 streamlit run web_demo.py --server.fileWatcherType none --server.port 8888
to run a demo in your local server.
This study was authored by Siyuan Wang, Zhongyu Wei, Yejin Choi and Xiang Ren. We encourage the use of our code and data in your research and kindly request citation of our paper as follows:
@article{wang2024can,
title={Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs},
author={Wang, Siyuan and Wei, Zhongyu and Choi, Yejin and Ren, Xiang},
journal={arXiv preprint arXiv:2402.11442},
year={2024}
}