This repository contains an improvement of CosmoFlow by adding the support of asynchronous data read opertions. CosmoFlow is a parallel deep learning application developed for studying data generated from cosmological N-body dark matter simulations. The source codes of CosmoFlow are available on both github and MLPerf. Programs in this repo update the CosmoFlow source codes by incorporating the LBANN model and parallelizing it using Horovod. The training data files are available at NERSC.
To reduce the cost of reading the training data from files and thus improve the end-to-end training time, this repo adds an asynchronous I/O module using the python multiprocessing package, which enables overlapping of file reads with the computation of model training on GPUs.
- TensorFlow > 2.0.0 (2.2.0 is recommended)
- Horovod > 0.16
-
Clone the source codes.
git clone https://github.com/swblaster/tf2-cosmoflow
-
Customize run-time paratemters by modifying the file paths in ./test.yaml.
frameCnt
: the number of samples in each file.numPar
: the number of parameters to be predicted.sourceDir/prj
: the top directory of the data files.subDir
: the sub-directory undersourceDir/prj
, where the actual files are located.splitIdx/train
: the indices of the training files.splitIdx/test
: the indices of the test files.
Below shows an example of
test.yaml
file.frameCnt: 128 numPar: 4 parNames: [Omega_m, sigma_8, N_spec, H_0] sourceDir: { prj: /global/cscratch1/sd/slz839/cosmoflow_c1/, subDir: multiScale_tryG/ splitIdx: test: [100, 101, 102, 103, 104, 105, 106, 107] train: [20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 32, 33, 34, 35, 36, 37, 40, 41, 42, 43, 44, 45, 46, 47, 50, 51, 52, 53, 54, 55, 56, 57, 60, 61, 62, 63, 64, 65, 66, 67, 70, 71, 72, 73, 74, 75, 76, 77, 80, 81, 82, 83, 84, 85, 86, 87, 90, 91, 92, 93, 94, 95, 96, 97]
-
Command-line Options
--epochs
: the number of epochs for training.--batch_size
: the local batch size (the batch size for each process).--overlap
: (0:off / 1:on) disable/enable the I/O overlap feature.--checkpoint
: (0:off / 1:on) disable/enable the checkpointing.--buffer_size
: the I/O buffer size with respect to the number of samples.--record_acc
: (0:off / 1:on) disable/enable the accuracy recording.--config
: the file path for input data configuration.--enable
: (0:off / 1:on) disable/enable evaluation of the trained model.--async_io
: (0:off / 1:on) disable/enable the asynchronous I/O feature.
-
Start Training Jobs of parallel training can be submitted to Cori's batch queue using a script file. An example is given in ./sbatch.sh. File ./myjob.lsf is an example script file for running on Summit at OLCF. Below shows an example python command that can be used in the job script file.
python3 main.py --epochs=3 \ --batch_size=4 \ --overlap=1 \ --checkpoint=0 \ --buffer_size=128 \ --file_shuffle=1 \ --record_acc=0 \ --config="test.yaml" \ --evaluate=0 \ --async_io=1
- Sunwoo Lee, Qiao Kang, Kewei Wang, Jan Balewski, Alex Sim, Ankit Agrawal, Alok Choudhary, Peter Nugent, Kesheng Wu, and Wei-keng Liao. Asynchronous I/O Strategy for Large-Scale Deep Learning Applications. In the 28th International Conference on High-Performance Computing, Data, and Analytics (HiPC), December 2021.
- Northwestern University
- Sunwoo Lee <[email protected]>
- Kewei Wang <[email protected]>
- Wei-keng Liao <[email protected]>
- Lawrence Berkeley National Laboratory
- Alex Sim <[email protected]>
- Jan Balewski <[email protected]>
- Peter Nugent <[email protected]>
- John Wu <[email protected]>
- Sunwoo Lee <[email protected]>
- Wei-keng Liao <[email protected]>
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program. This project is a joint work of Northwestern University and Lawrence Berkeley National Laboratory supported by the RAPIDS Institute. This work is also supported in part by the DOE awards, United States DE-SC0014330 and DE-SC0019358.