0. Configurations and benchmark running

DGXSYSTEM=DGX1 2-CPU 8-way GPU server with 20 cores per CPU, HT enabled (total cores per CPU = 40).

# Build docker image
docker build --pull -t  mlperf-nvidia:image_classification .

export NEXP=1
export DATADIR=<path-to-location-of-ImageNet-dataset> 
export LOGDIR=<path-to-where-you-want-to-store-logfiles>
export DGXSYSTEM=DGX1

./run.sub

1. Problem

This problem uses the ResNet-50 CNN to do image classification.

Requirements

nvidia-docker
MXNet 18.11-py3 NGC container

2. Directions

Steps to download and verify data

Download the data using the following command:

Please download the dataset manually following the instructions from the ImageNet website. We use non-resized Imagenet dataset, packed into MXNet recordio database. It is not resized and not normalized. No preprocessing was performed on the raw ImageNet jpegs.

For further instructions, see https://github.com/NVIDIA/DeepLearningExamples/blob/master/MxNet/Classification/RN50v1.5/README.md#prepare-dataset .

Steps to launch training

NVIDIA DGX-1 (single node)

Launch configuration and system-specific hyperparameters for the NVIDIA DGX-1 single node submission are in the config_DGX1.sh script.

Steps required to launch single node training on NVIDIA DGX-1:

docker build --pull -t mlperf-nvidia:image_classification .
DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX1 ./run.sub

NVIDIA DGX-2 (single node)

Launch configuration and system-specific hyperparameters for the NVIDIA DGX-2 single node submission are in the config_DGX2.sh script.

Steps required to launch single node training on NVIDIA DGX-2:

docker build --pull -t  mlperf-nvidia:image_classification .
DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX2 ./run.sub

NVIDIA DGX-1 (multi node)

Launch configuration and system-specific hyperparameters for the NVIDIA DGX-1 multi node submission are in the config_DGX1_multi.sh script.

Steps required to launch multi node training on NVIDIA DGX-1:

Build the docker container and push to a docker registry

docker build --pull -t <docker/registry>/mlperf-nvidia:image_classification .
docker push <docker/registry>/mlperf-nvidia:image_classification

Launch the training

source config_DGX1_multi.sh && CONT="<docker/registry>/mlperf-nvidia:image_classification" DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX1_multi sbatch -N $DGXNNODES -t $WALLTIME --ntasks-per-node $DGXNGPU run.sub

NVIDIA DGX-2 (multi node)

Launch configuration and system-specific hyperparameters for the NVIDIA DGX-2 multi node submission are in the config_DGX2_multi.sh script.

Steps required to launch multi node training on NVIDIA DGX-2:

Build the docker container and push to a docker registry

docker build --pull -t <docker/registry>/mlperf-nvidia:image_classification .
docker push <docker/registry>/mlperf-nvidia:image_classification

Launch the training

source config_DGX2_multi.sh && CONT="<docker/registry>/mlperf-nvidia:image_classification" DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX2_multi sbatch -N $DGXNNODES -t $WALLTIME --ntasks-per-node $DGXNGPU run.sub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

0. Configurations and benchmark running

1. Problem

Requirements

2. Directions

Steps to download and verify data

Steps to launch training

NVIDIA DGX-1 (single node)

NVIDIA DGX-2 (single node)

NVIDIA DGX-1 (multi node)

NVIDIA DGX-2 (multi node)

Files

README.md

Latest commit

History

README.md

File metadata and controls

0. Configurations and benchmark running

1. Problem

Requirements

2. Directions

Steps to download and verify data

Steps to launch training

NVIDIA DGX-1 (single node)

NVIDIA DGX-2 (single node)

NVIDIA DGX-1 (multi node)

NVIDIA DGX-2 (multi node)