Code for the paper Proof Repair Infrastructure for Supervised Models: Building a Large Proof Repair Dataset.
Please cite as
@inproceedings{reichel_proof_2023,
address = {Dagstuhl, Germany},
series = {Leibniz {International} {Proceedings} in {Informatics} ({LIPIcs})},
title = {Proof {Repair} {Infrastructure} for {Supervised} {Models}: {Building} a {Large} {Proof} {Repair} {Dataset}},
volume = {268},
copyright = {All rights reserved},
isbn = {978-3-95977-284-6},
shorttitle = {Proof {Repair} {Infrastructure} for {Supervised} {Models}},
url = {https://drops.dagstuhl.de/opus/volltexte/2023/18401},
doi = {10.4230/LIPIcs.ITP.2023.26},
urldate = {2023-07-27},
booktitle = {14th {International} {Conference} on {Interactive} {Theorem} {Proving} ({ITP} 2023)},
publisher = {Schloss Dagstuhl – Leibniz-Zentrum für Informatik},
author = {Reichel, Tom and Henderson, R. Wesley and Touchet, Andrew and Gardner, Andrew and Ringer, Talia},
editor = {Naumowicz, Adam and Thiemann, René},
year = {2023},
note = {ISSN: 1868-8969},
keywords = {machine learning, proof repair, benchmarks, datasets, formal proof},
pages = {26:1--26:20},
}
PRISM is a project management and data collection framework for the Coq Proof Assistant geared towards enabling creation and curation of AI/ML datasets for proof engineering tasks including proof repair. Major subcomponents include virtual environment management through opam switches, proof extraction utilities, and repair mining functions.
PRISM only supports Unix hosts and has only been tested in Ubuntu 20.04 and Ubuntu 22.04, so this guide assumes access to a Linux terminal. In addition, the following system packages or libraries must be installed:
- git
- make
- opam
- strace
- Navigate to top
prism
directory (the root of the repository). - Create Python virtual environment (default version is 3.11.4).
source setup_python.sh [Python 3 version (optional)]
- Install Coq (default version is 8.10.2).
Note that you may need to run
opam init
to sucessfully install Coq.source setup_coq.sh [Coq version number (optional)]
- Install PRISM. Note that this requires one to have activated a virtual environment in
the current shell (e.g., with
setup_python.sh
above).As a side-effect, this command will install and upgrade all PRISM installation dependencies, equivalent tomake install
pip install -Ur requirements.txt
. - Verify installation by running unit tests.
Note that this requires one to have activated an OPAM switch in the current shell
(e.g., with
setup_coq.sh
).make test
- Install development dependencies (linters, formatters, etc.) and pre-commit hooks.
Pre-commit hooks are automatically applied each time you try to commit changes and consist of a series of tests that must each pass before the commit is added to the repository's index. Some of these hooks may modify files (e.g., applying code styles). Any changes induced by the hooks must be manually staged for commit via
make dev
git add
before attempting togit commit
again. Pre-commit hooks may be skipped by adding the--no-verify
option togit commit
. However, skipping the hooks is generally discouraged since it merely delays the inevitable; server-side pre-commit hooks will be applied in any branch for which a merge request has been created and force pipeline failures (and thus prevent merges) until the hooks are satisfied. Only an administrator, project owner, or maintainer can bypass these server-side hooks when the target branch is protected.
Visual Studio Code with the following minimal extensions installed and enabled is the recommended IDE:
- Remote Development, Remote - SSH: For developing remotely on Radiance servers. Follow the official guide to set up SSH keys for password-less authentication.
- Remote - WSL: For developing locally in the Windows Subsystem for Linux.
- Python,
Pylance: For developing
Python packages and scripts.
One should configure the extension to use
yapf
andflake8
as the formatting and linting providers, respectively, with linting set to run upon saving of a file. A file may be manually formatted by opening the command palette with F1 and selectingFormat Document
. Imports may also be sorted through thePython Refactor: Sort Imports
command in the command palette. Static type checking should also be enabled withmypy
. - VSCoq: For syntax highlighting of Coq source files and possible eventual integration in Phase 2.
- GitLens: For viewing line-by-line history while editing.
- Trailing Whitespace: For visualizing and removing trailing whitespace on save. The extension settings should be configured to highlight the trailing whitespace and trim it on save.
- autoDocstring: For automatically
generating docstring stubs from type hints.
One should ensure that the extension settings are configured to use the
numpy
format, use"""
quote style, start the docstring on a new line, and to not include the function name at the start of the docstring. - select highlight in minimap: For better visibility of selections in the minimap.
In addition, one is recommended to set the following variables in their settings JSON:
"editor.rulers": [72, 80, 88]
and
"python.defaultInterpreterPath": "path/to/prism/venv-X.Y.Z/bin/pythonX.Y"
with the latter appropriately modified to match the location of the preferred Python interpreter.
Docstrings are generally expected to give a sufficient explanation of contents and functionality within PRISM source code. Users are encouraged to explore on their own. Some highlights are listed below:
- Project: The
Project
class is the central figure of PRISM, integrating operations including building, debugging, and dependency management.Project
includes methods for inferring metadata, enumerating project file, and parsing Coq sentences from project files. In practice, one will likely useProjectRepo
objects to work with Coq projects. - Project metadata: Metadata supplies
configuration information for a
Project
including build commands, dependencies, and options for line-by-line execution withSerAPI
. - Metadata storage: Metadata for multiple
commits or projects is aggregated in a
MetadataStorage
object, which can enumerate the stored metadata and perform insertion, update, and retrieval operations. A serialized example covering the projects in the dataset can be loaded withMetadataStorage.load
. - SerAPI: Originally based on the
SerAPI
class of CoqGym, this class provides a wrapper around thesertop
executable ofcoq-serapi
and provides tested support for 7 minor versions of Coq ranging from 8.9 to 8.15. - Command extractor: A higher-level
wrapper around
SerAPI
that tracks document context and aggregates proofs into objects with abstract syntax trees (ASTs), goals, hypotheses, and fully qualified identifiers. TheCommandExtractor
provides more straightforward rollback mechanisms thanSerAPI
. - Cache types: Caches are constructed in the first
phase of repair mining that contain detailed context for each command in a
project commit.
The data for a project commit is stored in an aptly named
ProjectCommitData
, which captures the build environment for reproduction in addition to the comments and commands of each file. AProjectCommitData
can be dumped to file usingProjectCommitData.write_coq_project
to reproduce the raw text of the commit's Coq files. - Repair example datatypes: Repair examples are
constructed in the second phase of repair mining, which operates over the cached commits.
A generic
RepairInstance
class provides the core structure of a repair example as an erroneous project state (andErrorInstance
) paired with a repair. Two concrete subclasses,GitRepairInstance
andProjectCommitDataRepairInstance
, specialize the base class to Git diffs with commit SHAs andProjectCommitDataDiff
s withProjectCommitData
s, respectively. The former provide standalone representations of repairs whereas the latter requireprism
to efficiently operate. Methods to construct each are provided. - opam: A basic Python API for interacting with
opam
, enabling the creation, deletion, and cloning of switches. Note that cloned switches can only be used through the Python API (namely theOpamSwitch
class) asopam
is ignorant of them. - opam switch: An abstraction of an
opam
switch, or virtual environment/sandbox, in which a project can be built in isolation. Enables the installation and removal of dependencies. Note that if theOpamSwitch
represents a clone, then theopam
-managed original switch must not be deleted until the clone is no longer needed. - Switch manager: The switch manager (SwiM) provides
switches upon request from a managed pool that satisfy given constraints.
For example, if one wants a switch with a certain set of packages installed, the
SwiM will find the closest switch in its pool to satisfying the requirements, clone it,
install the packages, and return another clone that will be entirely owned by the
caller.
The first clone is added to the managed pool in anticipation of accelerating subsequent
requests.
A least-recently-used (LRU) cache evicts stale switches that have not been requested
frequently.
Multiple flavors of
SwitchManager
are provided, but theAutoSwitchManager
is the easiest to use and implements the features described above.
Warning The SwiM can easily and quickly consume disk space. To clear the auto-generated switch clones, runrm -rf /path/to/your/opam/root/*_clone_*
.
Warning The SwiM assumes complete control of theopam
root. It is not safe to perform normalopam
operations or operate two SwiMs at once on the same root directory. TheSharedSwitchManager
provides a multiprocessing-safe implementation.
Standalone scripts that provide both core and superficial utilities are
housed in the scripts
directory.
Note that these scripts lie outside CI and are not necessarily
maintained.
Some important scripts are highlighted here.
- Cache extraction: Perform cache extraction
- Repair mining: Perform repair mining on an extracted cache of project commit data.
- Repair example filtering: Recursively scan a directory for repair instances matching one or more tags and either print a list of found files or simply count them.
The primary purpose of PRISM is to enable mining of proof repairs from Git repositories. The following steps are recommended to perform repair mining (after following the directions in Getting Started). Repair mining is divided into two phases: cache extraction and repair mining.
In the cache extraction phase, we iterate over the commits of one or
more projects and independently attempt to build them before extracting
line-by-line context.
The ultimate output for a given commit is a
ProjectCommitData
.
The cache files are stored in a directory structure defined by the
CoqProjectBuildCacheProtocol
.
The extract_cache.py
script is the
main entrypoint for cache extraction and will handle creation of a base
pool of switches for a SwiM to use.
Arguments to the script allow one to limit extraction to a subset of
projects, Coq versions, commits, or files.
See the script's help message for more details.
Builds are attempted for supported Coq versions in reverse order.
Extraction of a large number of projects or commits can take a very long
time, so it is recommended that one uses a terminal multiplexer such as tmux
.
Progress can be monitored in a separate shell with the
monitor-caching.py
, which will show the
number of successful and failed extractions.
Do not be alarmed by a large number of failures (shown as errors), as some
commits simply cannot be extracted.
Failures are grouped into three categories: build errors, cache errors, and miscellaneous errors. These errors may be treated as symptoms; determination of the underlying cause of failure is a manual procedure. Build errors indicate that something went wrong during a build process and may indicate either a buggy project commit or incorrect metadata that could not be repaired by builtin inference mechanisms. Cache errors imply a bug in the extraction code itself, either from a flawed assertion/assumption or some other logical error. Cache errors are the only category indicative of something unambiguously wrong, but each such error generally impacts only a subset of commits. Miscellaneous errors are captured exceptions that do not fall within the previous two categories. Failure to obtain a switch from the SwiM due to unsatisfiable dependencies is a common miscellaneous error, which may or may not be indicative of flawed metadata.
If extraction of a project commit encountered an error, one can open the error log in the cache to obtain a stack trace, which may also identify the Coq file and line at which the error occurred. Running extraction limited to just the project, commit, Coq version, and file enables one to use a debugger to more precisely identify the cause and possible fixes.
Extraction contains multiprocessing components across projects and files,
so be sure that the number of workers chosen is appropriate for your available
resources.
In addition, be prepared for a significant consumption of disk space from
cloned switches created by the SwiM.
Finally, note that even though --max-proj-build-runtime X
will abort an extraction
if the project takes more than X
seconds to build, a timeout on the actual
extraction of commands via line-by-line execution may last longer.
In the repair mining phase, we consider pairs of extracted commits and mine
repairs from their changes.
The ultimate output is a collection of pairs of files, each storing a serialized
GitRepairInstance
and
ProjectCommitDataRepairInstance
, respectively.
The mine_repair_instances.py
script is the
main entrypoint for repair mining and takes a configuration file as its only argument.
A sample configuration is provided by
mine_repair_instances
.
Note that one should adjust the cache_root
, repair_instance_db_directory
, and
metadata_storage_file
to match the working directory when the repair mining script
is called.
In order to prevent successive repair mining runs from clobbering one another,
especially when there have been changes to prism
, it is recommended that one first
generate a configuration with produce_config.py
,
which will generate a base configuration file with the current commit as the name
of the root directory.
The configuration file simply stores a JSON dictionary containing keyword arguments
to the repair_mining_loop
, where one can find
a description for each.
Repair mining contains multiprocessing components and loading large files into memory,
so be sure that the number of workers chosen is appropriate for your available
resources.
Choosing too many workers may result in difficult-to-debug out-of-memory errors.
Note that fast
is enabled by default in generated configurations, which strips the
loaded caches (and resulting ProjectCommitDataRepairInstance
s) of ASTs and other
expensive to process/store data.
One may wish to perform repair mining on (disjoint) caches on one or more computers.
Repair instance databases produced in this manner can be merged after the fact using
the merge_repair_instance_dbs.py
script,
which will deduplicate any repairs in their intersection.
Some basic summary statistics can be collected using the
filter_repair_examples.py
.
For example, one may count the total number of repair examples in a database with a
given (set of) tag(s).
Tags of particular importance include repair:proof
and repair:specification
.
The former indicates a repair to the body of a proof, whereas the latter indicates a
repair to a lemma/conjecture itself.
Cataloguing repairs based on the type of Vernacular command
(e.g., VernacStartTheoremProof
) can be done using tags of the form
repair:[CommmandTypeName]
, e.g., repair:VernacDefinition
.
This work was supported by the Defense Advanced Research Projects Agency (DARPA) under Agreement HR00112290067.
The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
We also want to acknowledge and thank Dr. Talia Ringer and Thomas (Tom) Reichel of the
University of Illinois at Urbana-Champaign for their valuable insights and contributions
to this work.
Talia championed the ITP 2023 paper and the overall importance of the endeavor, and
Tom implemented solutions to multiple critical issues including but not
limited to switch cloning, sequence alignment with numba
, and dependency inference.
Without their help, this work would be diminished.