In this tutorial, we are going to make a CNN model inference software. This tutorial is based on Menoh's original tutorial.
This script loads data/VGG16.onnx
and takes input image, then outputs classification result.
For gettinig ONNX model's named variables, please refer to Menoh Tutorial.
VGG16 has one input and one output. So now we can check that the input name is 140326425860192 (input of 0:Conv) and the output name is 140326200803680 (output of 39:Softmax).
Some of we are interested the feature vector of input image. So in addition, we are going to take the output of 32:FC(fc6, which is the first FC layer after CNNs) named 140326200777584.
We define name aliases for convenience:
CONV1_1_IN_NAME = '140326425860192'.freeze
FC6_OUT_NAME = '140326200777584'.freeze
SOFTMAX_OUT_NAME = '140326200803680'.freeze
To build model, we load model data from ONNX file:
onnx_obj = './data/VGG16.onnx'
Now let's build the model.
# data shape of input images
input_shape = {
channel_num: 3,
width: 224,
height: 224
# model options for model
model_opt = {
backend: 'mkldnn',
input_layers: [
name: CONV1_1_IN_NAME,
dims: [
output_layers: [FC6_OUT_NAME, SOFTMAX_OUT_NAME]
# make model for inference under 'model_opt'
model = onnx_obj.make_model model_opt
Before running the inference, the preprocessing of input dataset is required. data/VGG16.onnx
takes 3 channels 224 x 224 sized image but input image is not always sized 224x224. So we use Imagemagick's resize_to_fill
method for resizing.
's input layer 140326425860192 takes images as NCHW format (N x Channels x Height x Width). But RMagick's image array has alternately flatten values for each channel. So next we call export_pixels
method for each channels ['B', 'G', 'R']
, then flatten
image_list = [
image_set = [
name: CONV1_1_IN_NAME,
data: do |image_filepath|
image =
image = image.resize_to_fill(input_shape[:width], input_shape[:height])
'BGR'.split('').map do |color|
image.export_pixels(0, 0, image.columns, image.rows, color).map { |pix| pix / 256 }
In current case, the range of pixel value data/VGG16.onnx
taking is [0, 256]. On the other hand RMagick's image array takes [0, 65536]. So we have to scale the values by dividing 256.
And sometimes model takes values scaled in range [0, 1] or something. In that case, we can scale values here:
image_set = [
name: CONV1_1_IN_NAME,
data: do |image_filepath|
image =
image = image.resize_to_fill(input_shape[:width], input_shape[:height])
'BGR'.split('').map do |color|
image.export_pixels(0, 0, image.columns, image.rows, color).map { |pix| pix / 65536 }
Now we can run the inference.
# execute inference
inference_results = image_set
The inference_results
is the array that contains the hash of results of output_layers
. So you can get each value as follows.
fc6_out = inference_results.find { |x| x[:name] == FC6_OUT_NAME }
softmax_out = inference_results.find { |x| x[:name] == SOFTMAX_OUT_NAME }
That's it.
The full code is available at VGG16 example.