Skip to content

llidev/batchai_inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Distributed Batched Scoring Using Azure Batch AI

This recipe shows how to run distributed Batched scoring Using Azure Batch AI, and run benchmark in term of the GPU cluster scale.

Image Data, Codebase and Models

The image dataset for scoring is provided by CaiCloud and VIP.com. The dataset is hosted in Azure Storage account octopustext and under Blob Container batchaisample. The dataset contains 366111 jpg format images with total size approximately 45 GB.

The main codebase and pertained model files are also provided by CaiCloud, and are hosted in the same storage account under File Share abc. The codebase contains all required dependency to run batch scoring of image classification and cloth recognition. Pretained VGG16 and Inception V3 models have also been uploaded to the File Share under the output directory.

The Storage account octopustext is in East US data center. Please use Azure Storage Explorer or Azure Portal to view the detail directory structures in Blob Container batchaisample and File Share abc.

Batch Scoring Job Script

The main script used for the scoring job dist_inference.py locates in the root directory of File Share abc. To view it, please download it from Azure Storage.

The input argument '--inference' is used to specify the scoring task: 'cloth_recognition' or 'classification'. The main idea is to shad the whole dataset into partitions based on the total number of workers, and each work process its assigned partition of images independently. There is no inter-communication between each work.

When the script completes, it outputs the total number of images it processed and how long it took, such as:

Worker 0 Processed 3124 images, took 0:10:43.983157

Please feel free to edit/optimize the logics of the script if needed

Prerequisites

Please place the filled configuration.json in the same directory based on the template. It should includes the Azure Batch AI authentication information and credentials of the Storage Account octopustext.

Install Azure Batch AI management client using the following command:

pip install azure-mgmt-batchai

We may need to utilize APIs from other Azure products (e.g, Azure storage, credentials), it is also required to install the full package of Azure Python SDK:

pip install azure

Install Jupyter Notebook from https://jupyter.org/ or run

python -m pip install jupyter

Run the Batch Scoring Recipe

This Jupyter Notebook file contains information on how to run Batch Scoring job on a GPU node with BatchAI. You will be able to tune variables including node_count, vm_size to obtain different benchmark results.

Note that, since there is no communication between each worker, Parameter Server is not required in this case. Therefore, we use the customToolkitSettings in the Batch AI job definition (instead of TensorFlowSettings) and use OpenMPI to launch and monitor all workers more efficiently. The OpenMPI binary is installed in container using JobPreparation task.

Benchmark Results

The below table illustrates elapsed time to label 100k images for 'cloth_inference'(VGG-16) task:

Number of GPUs 1 8 16 32
K80 741 mins 99 mins 49 mins 25 mins
P100 255 mins 32 mins 19 mins 10 mins

Qusi-linear scaling-up can be observed in terms of number of GPUs.

The benchmark for 'classification'(Inception-V3) task has not been done yet. The test code needs to be optimized to achieve higher GPU efficiency.

Reference

  • To transfer large amount of data between local device and Azure Storage, please use AzCopy or Blobxfer instead of Portal/Storage Explorer

  • A detail reference of Batch AI python SDK can be found here

  • If you prefer to use Azure CLI 2.0 instead of python SDK, please see the article for instruction.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published