The CVAE data is read from a folder containing two TFRecords files, train.tfrecords
and val.tfrecords
.
The data should be written as a serialized array of boolean values.
The folder path and the shape of each TFRecord example must be specified in configs/params.yaml.
The data is then reshaped and cropped (upper-left corner) to be of size input_shape
, before being fed to the model.
To train on the Cerebras System (from ANL-shared/ directory, on the host machine):
cd ANL-shared/cvae/tf
- Train command with an orchestrator like Slurm:
csrun_wse python run.py --mode train -p configs/params.yaml --model_dir $OUTPUT_DIRECTORY --cs_ip X.X.X.X
*note: specifyingcs_ip
instructs the run.py script to run on CS hardware.
To train on cpu/gpu (from ANL-shared/ directory):
cd ANL-shared/cvae/tf
python run.py --mode train --model_dir $OUTPUT_DIRECTORY -p configs/params.yaml
Where:
OUTPUT_DIR
= Path to save trained models
- Within the ANL environment, all files required for training must be in the
/data/...
root directory path so that they will be accessible inside the container.
To run Evaluation or prediction on any device, follow the Training instructions for that device, but pass in eval
as the --mode
.
To skip running the model and only compile the model the --validate_only
and --compile_only
flags can be used when running inside the cbcore container:
validate_only
: Compile model up to kernel matchingcompile_only
: Compile model completely, generating compiled executables