DGXSYSTEM=DGX1
2-CPU 8-way GPU server with 20 cores per CPU, HT enabled (total cores per CPU = 40).
# Build docker image
docker build --pull -t mlperf-nvidia:image_classification .
export NEXP=1
export DATADIR=<path-to-location-of-ImageNet-dataset>
export LOGDIR=<path-to-where-you-want-to-store-logfiles>
export DGXSYSTEM=DGX1
./run.sub
This problem uses the ResNet-50 CNN to do image classification.
Download the data using the following command:
Please download the dataset manually following the instructions from the ImageNet website. We use non-resized Imagenet dataset, packed into MXNet recordio database. It is not resized and not normalized. No preprocessing was performed on the raw ImageNet jpegs.
For further instructions, see https://github.com/NVIDIA/DeepLearningExamples/blob/master/MxNet/Classification/RN50v1.5/README.md#prepare-dataset .
Launch configuration and system-specific hyperparameters for the NVIDIA DGX-1
single node submission are in the config_DGX1.sh
script.
Steps required to launch single node training on NVIDIA DGX-1:
docker build --pull -t mlperf-nvidia:image_classification .
DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX1 ./run.sub
Launch configuration and system-specific hyperparameters for the NVIDIA DGX-2
single node submission are in the config_DGX2.sh
script.
Steps required to launch single node training on NVIDIA DGX-2:
docker build --pull -t mlperf-nvidia:image_classification .
DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX2 ./run.sub
Launch configuration and system-specific hyperparameters for the NVIDIA DGX-1
multi node submission are in the config_DGX1_multi.sh
script.
Steps required to launch multi node training on NVIDIA DGX-1:
- Build the docker container and push to a docker registry
docker build --pull -t <docker/registry>/mlperf-nvidia:image_classification .
docker push <docker/registry>/mlperf-nvidia:image_classification
- Launch the training
source config_DGX1_multi.sh && CONT="<docker/registry>/mlperf-nvidia:image_classification" DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX1_multi sbatch -N $DGXNNODES -t $WALLTIME --ntasks-per-node $DGXNGPU run.sub
Launch configuration and system-specific hyperparameters for the NVIDIA DGX-2
multi node submission are in the config_DGX2_multi.sh
script.
Steps required to launch multi node training on NVIDIA DGX-2:
- Build the docker container and push to a docker registry
docker build --pull -t <docker/registry>/mlperf-nvidia:image_classification .
docker push <docker/registry>/mlperf-nvidia:image_classification
- Launch the training
source config_DGX2_multi.sh && CONT="<docker/registry>/mlperf-nvidia:image_classification" DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX2_multi sbatch -N $DGXNNODES -t $WALLTIME --ntasks-per-node $DGXNGPU run.sub