This is Chris Beckham’s code to run diffusion operators (score matching with neural operators), to accompany the following paper:
@article{lim2023score, title={Score-based diffusion models in function space}, author={Lim, Jae Hyun and Kovachki, Nikola B and Baptista, Ricardo and Beckham, Christopher and Azizzadenesheli, Kamyar and Kossaifi, Jean and Voleti, Vikram and Song, Jiaming and Kreis, Karsten and Kautz, Jan and others}, journal={arXiv preprint arXiv:2302.07400}, year={2023} }
This repository was developed on a cluster that implements Slurm for scheduling and running experiments, and therefore it is built accommodating its various intricacies. However running jobs without it is also supported here.
Python >= 3.7 is required. The following dependencies are required:
torch
(>=1.12 should work)torchvision
(>=0.13.1 should work)scipy
omegaconf
neuraloperator
(github)- specifically commit
8a0f526
, I haven’t tested with newer versions yet so I don’t know if they break the pretrained checkpoints
- specifically commit
tqdm
To create a Conda environment with the aforementioned dependencies, you can more or less run the following:
conda create --prefix <env dir>/<my env name> python=3.9
conda activate <env dir>/<my env name>
# Install pytorch
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install omegaconf tqdm scipy opencv-python matplotlib
# Install neuraloperator
git clone https://github.com/neuraloperator/neuraloperator
cd neuraloperator
pip install -r requirements.txt
# This repo was developed under the following commit hash, though newer versions may work (yet to be tested).
git checkout 8a0f526
pip install -e .
Before we start running experiments we need to set up env.sh
, which is in the exps
directory. First run cp env.sh.bak env.sh
and then modify env.sh
so that both a save directory and a data diretory is defined:
# env.sh
# Specify directory to save results to
export SAVEDIR=...
# Specify directory where volcano data is
export DATA_DIR=...
When this is done, run source env.sh
.
The dataset used is Volcano and can be downloaded here. Download it to your dataset directory specified by $DATA_DIR
(in env.sh
) and extract it by running:
cd $DATA_DIR
tar -xvf InSAR_Volcano.tar
After this, your $DATA_DIR
should contain a folder called subset_128
such that this following directory $DATA_DIR/subset_128/
exists and has all the necessary files inside it.
For more details on the Volcano dataset, please consult the GANO paper:
@article{rahman2022generative, title={Generative adversarial neural operators}, author={Rahman, Md Ashiqur and Florez, Manuel A and Anandkumar, Anima and Ross, Zachary E and Azizzadenesheli, Kamyar}, journal={arXiv preprint arXiv:2205.03017}, year={2022} }
To run a diffusion experiment locally, simply cd into exps
and run:
RUN_LOCAL=1 bash main.sh sbgm <experiment name> <json file>
where <json file>
is one of the files in exps/json
and specifies the inputs and hyperparameters to an experiment. (All hyperparameters are documented in the Arguments
dataclass which you can find in train.py
.) The RUN_LOCAL
command simply says that the code should be run from the parent directory of exps
, rather than copy the code over to $SAVEDIR/<id>/<experiment name>
and run from there (the latter should only be done when an actual experiment is being launched from Slurm).
The above command executes an experiment and creates a results in $SAVEDIR/<slurm job id>/<experiment name>
. Various files get saved here, but the most important ones are:
config.json
: the configuration of the experiment, which is basically the hyperparameters specified in<json file>
as well as any default hyperparameters that were not specified.results.json
: a text file, where each row is a json data structure indicating per epoch metrics.model.<valid metric>.pt
: this code implements three validation metrics: the Wasserstein distance between the real/generated skew histograms (w_skew
) the same for variance (w_var
) and the sum of themw_total
. Each time a new smallest value is found for any of these three metrics, a new checkpoint is saved to the filemodel.<valid metric>.pt
.
For Slurm-enabled clusters, simply run:
sbatch launch_exp.sh sbgm <experiment name> <json file>
Note that the SBATCH flags at the header of launch_exp.sh
should be modified according to what is appropriate for your cluster environment (e.g. GPU selected, available memory, etc.). Again, the SBATCH flags provided are specific to to my own development environment.
If you are not running this on a Slurm environment, then you should also define SLURM_JOB_ID
to something that can uniquely identify your experiment. For instance, we can just use the Unix timestamp:
RUN_LOCAL=1 SLURM_JOB_ID=`date +%s` bash main.sh <experiment name> <json file>
In the results directory $SAVEDIR/<slurm job id>/<experiment name>
, the following things are saved periodically:
noise/noise_samples.{pdf,png}
: samples from the noise distribution used.noise/init_samples.{pdf,png}
: ignore this, it should be the same as the above.noise/noise_sampler_C.{pdf,png}
: the first 200 cols/rows of the computed covariance matrix.u_noised.png
: for a random image (function) from the training setu
, show the functionu + c * z
, wherec
is a coefficient fromσ_1
toσ_L
andz
is a sample from the noise distribution.samples/<epoch>.{pdf,png}
: samples generated from the model after this particular epoch of training.
Experiments are run by running a launch script that specifies (1) the type of model being trained (either sbgm
or gano
); (2) the name of the experiment (user chosen) and (3) the path to a json file which details all of the hyperparameters to be used. (To see what hyperparameters exist, please consult the Arguments
dataclass in train.py
.)
For the following commands, if you are not using Slurm, simply set SLURM_JOB_ID
to your own unique identifier and launch with bash
instead of sbatch
.
cd into exps
and run:
sbatch launch_exp.sh sbgm indep_experiment json/indep-copied.json
The main flag to be aware of here is white_noise
and should be set to true
. When this is true, the rbf*
flags are ignored.
cd into exps
and run:
sbatch launch_exp.sh sbgm rbf_experiment json/rbf-copied.json
The main flags to be aware of here are:
rbf_scale
(the smoothness parameter of the RBF kernel, larger values correspond to smoother noise)rbf_eps
(regularisation factor for the covariance matrix so the Cholesky decomposition is stable)white_noise
(should be set tofalse
)
We have a separate evaluation script which can be used to dump samples to disk, as well as evaluating the validation metrics used but on a larger set of samples. To generate samples, we run:
RUN_LOCAL=1 bash launch_eval.sh sbgm <experiment name>/<id> --mode=generate
This script will dump various pkl files out to <experiment name>/eval
which are used for subsequent scripts.
To produce histogram / Wasserstein distance plots for the generated samples, simply run:
python plot_stats.py \
--exp_name=$SAVEDIR/<experiment name>/<id> \
--stats_filename=eval/stats.<checkpoint>.pkl \
--out_filename=<out filename>.pdf
To generate samples which can be visualised, simply run:
RUN_LOCAL=1 bash launch_eval.sh sbgm <experiment name>/<id> --mode=plot
To see what additional flags are supported, check out the argparse flags in eval.py
. For example, by default the checkpoint used is model.w_total.pt
(i.e. --checkpoint=model.w_total.pt
) which is the model checkpoint corresponding to the smallest observed validation metric w_total
.
Download the pretrained checkpoint here. Extract it to your $SAVEDIR
and run:
tar -xvzf indep-checkpoint.tar.gz
The directory $SAVEDIR/tmp100_diffusion_uno-fnoblock_properconv_ngf128_tucker/3068817
should exist if you have extracted the checkpoint correctly. Then cd back into this repo then exps
then run:
RUN_LOCAL=1 bash launch_eval.sh sbgm \
"tmp100_diffusion_uno-fnoblock_properconv_ngf128_tucker/3068817" \
--checkpoint=model.w_total.pt \
--mode=generate
RUN_LOCAL=1 bash launch_eval.sh sbgm \
"tmp100_diffusion_uno-fnoblock_properconv_ngf128_tucker/3068817" \
--checkpoint=model.w_total.pt \
--mode=plot
Download the pre-trained checkpoint here. Extract it to your $SAVEDIR
and run:
tar -xvzf rbf-checkpoint.tar.gz
The directory $SAVEDIR/tmp2000_rbf_pred-noise-b_repeat/3307092
should exist if you have extracted the checkpoint correctly. Then cd back into this repo then exps
then run:
RUN_LOCAL=1 bash launch_eval.sh sbgm \
"tmp2000_rbf_pred-noise-b_repeat/3307092" \
--checkpoint=model.w_total.pt \
--mode=generate
To generate the histograms, run:
python plot_stats.py \
--exp_name=$SAVEDIR/tmp2000_rbf_pred-noise-b_repeat/3307092 \
--stats_filename=eval/stats.model.w_total.pt.pkl \
--out_filename=stats.pdf
(this will output a stats.pdf
in the current directory)
To generate images, run:
RUN_LOCAL=1 bash launch_eval.sh sbgm \
"tmp2000_rbf_pred-noise-b_repeat/3307092" \
--checkpoint=model.w_total.pt \
--mode=plot
(this will output images in the samples
subdirectory of the experiment folder)
Here we train diffusion operators on the original images downsampled to 60px. At generation time, we sample noise that is twice that resolution, effectively performing super-resolution from 60px to 120px.
Download the pre-trained checkpoint here and extract it to your $SAVEDIR
. This tarfile contains the following experiments, each varies by the amount of RBF kernel smoothness used:
tmp2000_rbf_pred-noise_eqn1c_64px/3431833 -> rbf_scale = 0.05
tmp2000_rbf_pred-noise_eqn1c_64px/3431835 -> rbf_scale = 0.1
tmp2000_rbf_pred-noise_eqn1c_64px/3431834 -> rbf_scale = 0.2
To perform super-res generation for any of them, run:
# e.g. <name> = tmp2000_rbf_pred-noise_eqn1c_64px/3431835
RUN_LOCAL=1 bash launch_eval.sh sbgm \
<experiment name>/<id> \
--checkpoint=model.w_total.pt \
--mode=superres
For example, for our best performing experiment (an RBF scale of 0.2), generate the histograms with:
python plot_stats.py \
--exp_name=$SAVEDIR/tmp2000_rbf_pred-noise_eqn1c_64px/3431834 \
--stats_filename=eval/stats_2x.model.w_total.pt.pkl \
--gt_stats_filename=eval/gt_stats_2x.pkl \
--out_filename=stats_superres.pdf
(note the specification of gt_stats_filename=eval/gt_stats_2x.pkl
, we want to compare to the original 120px ground truth images, not the 60px ones)
To generate images, run:
RUN_LOCAL=1 bash launch_eval.sh sbgm \
"tmp2000_rbf_pred-noise_eqn1c_64px/3431834" \
--checkpoint=model.w_total.pt \
--mode=plot
(this will output images in the samples
subdirectory of the experiment folder)