GNN extension and improvements on CoqGym repository.
Note, this repo includes the codebase of Coq, SerAPI, CoqHammer, and the Coq projects in coq_projects.
- Notation
- Project Goals
- Main Contributions
- Setup and Installation
- Running train and test pipelines
- FAQ and Known Bugs
- Resources
Here are some important notation to understand the below explanations:
-
$G = (V, E)$ : Graph notation, where$|V|$ is the number of vertices (nodes) and$|E|$ is the number of edges. -
x
:$\mathbb{R} ^{|V|}$ node list of node types, referenced by index into the non-terminal node information. Child-first ordering enforced bytraverse_postorder
-
edge_index
:$\mathbb{R} ^ 2 \times \mathbb{R}^{|E|}$ edge list referenced by index intox
.
With the advent of recent progress in graph neural networks (GNNs), we hope to improve on the original CoqGym results by replacing their TreeLSTM encoder module with various GNN implementations.
The main contributions of this repository are some pipeline modifications to allow for the GNNs, and the actual GNN implementation. In the following sections, these modifications will be presented in detail with comparisons to the original when relevant.
These implementations are split between many branches to provide easier management of different tests:
master
: Contains the original CoqGym paper at the time of the fork.bofb
: Contains first implementation of graph batchesrl-mods
: Modified decoder with more expressive attention mechanismint-emb
: Contains implementation ofIntegerFeatureEncoder
in the encoder and the modifications fromrl-mods
.
All of the above branches (except for master
) have implementations of the GNN.
In order to learn using GNNs efficiently, x
and edge_index
information need to be extracted from the lark.tree.Tree
representations of ASTs. This computation is very costly, so it is delegated to the proof extraction stage. Here are the pipeline modifications that facilitate this change:
- Modified proof step data representation from
dict
totorch_geometric.data.Batch
objects - Modified merge operation to facilitate
Batch
merging - Used
torch_geometric.data.Dataset
object overtorch.data.Dataset
object - Changed saving protocol from
.pickle
to PyTorch-optimized.pt
usingtorch.save()
More explicitly, a comparison of the proof step structures is outlined below:
CoqGym proof step:
{
file : str,
proof_name : str,
n_step : int,
env : [
{
qualid: str,
ast : lark.tree.Tree
},
...
],
local_context : [
{
ident : str
ast : lark.tree.Tree
},
...
],
goal : lark.tree.Tree,
tactic_actions : list[int | str],
tactic_str : str,
}
CoqGym-GNN proof step:
torch_geometric.data.Batch (
x : tensor.Tensor,
edge_index : tensor.Tensor,
batch : tensor.Tensor,
# Some modifications to original CoqGym attributes
file : str,
proof_name : str,
n_step : int,
env : [
{
qualid: str,
ast : lark.tree.Tree
},
...
],
local_context : [
{
ident : str,
text : str,
ast : lark.tree.Tree
},
...
],
goal : {
id : int,
text : str,
ast : lark.tree.Tree
}
tactic_actions : int | str,
tactic_str : str,
)
Along with these changes, optimizations were made to the data generation process which facilitates easier updates to the extracted dataset and lighter computational resource requirements. Specifically:
- Added
filter_file
option initer_proofs
and its derivatives to skip loading of proof if not being considered - Save files as data is generated, so a monolithic list is not needed to keep track of data.
- Added multi-processing script for both
extract_proof.py
andevaluate.py
for more efficient use of computational resources.
Some notable design modifications in both the new GNN encoder and the existing RL pipeline were made for testing purposes, which are listed below
- 2-layer GNN with modular convolutions
- Multi-headed Graph Attention (GAT) or GraphSage convolutions
- Used
torch_geometric.graphgym.models.encoder.IntegerFeatureEncoder
over one-hot encodings of node types - Increased expressiveness of attention module in the decoder
- Added an extra layer
- Used PreLu activations between each layer
- Used batch normalization within each layer
CoqGym has many dependencies and is nontrivial to set up correctly. The following instruction detail how to obtain the CoqGym dataset and build the interaction environment natively.
- OCaml Package Manager (OPAM) is used to install OCaml and the corresponding packages.
- Lightning Memory-Mapped Database (LMDB) is used to store S-expressions in
*.json
files.
- Create an OPAM switch for OCaml 4.07.1+flambda:
opam switch create 4.07.1+flambda && eval $(opam env)
- Clone the repository:
git clone https://github.com/danjenson/CoqGym-GNN.git
- Install Coq, SerAPI and CoqHammer:
cd CoqGym && source install.sh
- Build the Coq projects (can take a while):
cd coq_projects && make && cd ..
- Setup the python environment (see requirements.txt for version details):
curl https://pyenv.run | bash
pyenv install 3.7.1 && pyenv local 3.7.1
pip install numpy ipython lark-parser==0.6.5 lmdb==0.94 pandas==0.24.2 pexpect==4.6.0 sexpdata==0.0.3 progressbar2
pip install torch pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv torch_geometric -f https://data.pyg.org/whl/torch-1.13.0+cu117.html
Note: Coq, SerAPI, CoqHammer, and the Coq projects in coq_projects directory are indendent software projects with their own code repositories, but please follow the instructions above to build the specific versions we need.
-
Download the CoqGym dataset here
-
Unzip the data and set the paths:
python unzip_data.py
Caveat: The second step sets the absolute paths in the data. You have to re-do it whenever the absolute path of the data/
directory changes (e.g. after moving the entire repo to another directory).
Run python eval_env.py
to check if it terminates normally without raising an error.
Now you are ready to interact with CoqGym!
Our encoder-decoder models are trained on individual proof steps rather than entire proofs. This allows use to directly use teacher forcing.
To extract proof steps from the CoqGym dataset, run python extract_proof_steps.py
from the ASTactic directory. Note, this can take a while (8-12 hours). To help, we provide an alternate multiprocessing script to parallelize extraction across proof libraries (coq projects) python multiprocess_extract.py
.
The extracted proof steps are in proof_steps/. You can double-check the number of proof steps to make sure everything works as expected:
Directory | # files |
---|---|
proof_steps/train | 121,644 |
proof_steps/valid | 68,180 |
We also provide pre-extracted download tarballs for train
and valid
proof steps here.
To train on the proof steps in training + validation set, run the following command from the ASTactic directory:
python main.py --no_validation --exp_id <model_id> --model_type <model_type> --heads <num_heads>
Model checkpoints will be saved to runs/astactic/checkpoints/. See options.py for command line options.
CoqGym's pre-trained astatic model can be downloaded here.
Our pre-trained GNN models can be downloaded here
To test a trained model on unseen proof libraries, run the following command from the ASTactic directory:
python evaluate ours <model_id> --path runs/<model_id>/checkpoints/model_<epoch#>.pth --filter <proof_library_name>
-
To execute testing just a single proof (e.g.
get_set_name
from../data/StructTact/Assoc.json
):python evaluate.py ours ours-TEST --path runs/astactic/checkpoints/model_003.pth --file ../data/StructTact/Assoc.json --proof "get_set_same"
-
Testing an automated tactic X (may be "auto", "trivial", "easy", "intuition", or "hammer"):
python -u evaluate.py X X-TEST --file ../data/StructTact/Assoc.json --proof "get_set_same"
-
Testing ASTactic+X:
python -u evaluate.py ours+X ours+X-TEST --path runs/astactic/checkpoints/model_003.pth --file ../data/StructTact/Assoc.json --proof "get_set_same"
Caveat: Testing is computationally expensive, but the workloads are very parallelizable. We provide the code for this in multiprocess_test.py
.
-
Error in
source install.sh
- Double check requirements
-
Failed to build
coqhammer
make clean
inASTactic/coqhammer
and remakecoqhammer
-
make
incoq_projects
failsmake clean
and remake the entire folder- If that doesn't work, double check requirements and reset the directory to its fresh state
-
pip
failed to installlmdb==0.94
- Try installing
lmdb==1.0
- Try installing
-
EOF
error while loading.pt
objects- Rebuild that project/file specifically
Data can be obtained from the original CoqGym repo here
Our pre-trained models and pre-extracted proof steps for training can be obtained here