diff --git a/quickstart/IntroNotebooks/2. Using the Tensorflow TensorRT Integration.ipynb b/quickstart/IntroNotebooks/2. Using the Tensorflow TensorRT Integration.ipynb deleted file mode 100644 index 7eda93b6..00000000 --- a/quickstart/IntroNotebooks/2. Using the Tensorflow TensorRT Integration.ipynb +++ /dev/null @@ -1,664 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Using TF-TRT With Tensorflow 2:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Tensorflow/TensorRT integration (TF-TRT) is a high level Python interface for TensorRT that works directly with Tensorflow models. In Tensorflow 2, TF-TRT allows you to convert Tensorflow SavedModels to TensorRT optimized models and run them within Python. This is a simple and flexible way to get started with TensorRT when using Tensorflow.\n", - "\n", - "This notebook provides a basic introduction and wrapper that makes it easy to work with basic Keras/TF2 models. We will take a pretrained Resnet-50 model from the keras.applications model zoo, convert it using TF-TRT, and run it in the TF-TRT Python runtime!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Use this when:\n", - "- You want the API with the least dependencies\n", - "- You are willing to give up some optimizations in exchange for more flexibility\n", - "- You have a network which contains operations unsupported by the ONNX parser but still want to use an automatic parser\n", - "- You do not want to write custom C++ plugins/optimizations if your network has unsupported operations\n", - "- You are okay with being limited to the Tensorflow or TRITON runtimes in most cases" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For the TF-TRT portion of this guide, we will be using a wrapper included with the notebooks in the [TensorRT OSS examples](https://github.com/NVIDIA/TensorRT).\n", - "\n", - "You can clone the entire repository and work inside it, or you can grab just the wrapper by:" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "--2021-01-29 23:37:25-- https://raw.githubusercontent.com/NVIDIA/TensorRT/main/quickstart/IntroNotebooks/helper.py\n", - "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.40.133\n", - "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.40.133|:443... connected.\n", - "HTTP request sent, awaiting response... 404 Not Found\n", - "2021-01-29 23:37:25 ERROR 404: Not Found.\n", - "\n" - ] - } - ], - "source": [ - "!wget \"https://raw.githubusercontent.com/NVIDIA/TensorRT/main/quickstart/IntroNotebooks/helper.py\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Checking your GPU status:__\n", - "\n", - "Lets see what GPU hardware we are working with. Our hardware can matter a lot because different cards have different performance profiles and precisions they tend to operate best in. For example, a V100 is relatively strong as FP16 processing vs a T4, which tends to operate best in the INT8 mode." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Fri Jan 29 23:37:26 2021 \n", - "+-----------------------------------------------------------------------------+\n", - "| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.1 |\n", - "|-------------------------------+----------------------+----------------------+\n", - "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", - "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", - "| | | MIG M. |\n", - "|===============================+======================+======================|\n", - "| 0 Tesla V100-DGXS... On | 00000000:07:00.0 Off | 0 |\n", - "| N/A 42C P0 37W / 300W | 125MiB / 16155MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 |\n", - "| N/A 42C P0 38W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 2 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 |\n", - "| N/A 41C P0 38W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 3 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 |\n", - "| N/A 42C P0 37W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - " \n", - "+-----------------------------------------------------------------------------+\n", - "| Processes: |\n", - "| GPU GI CI PID Type Process name GPU Memory |\n", - "| ID ID Usage |\n", - "|=============================================================================|\n", - "+-----------------------------------------------------------------------------+\n" - ] - } - ], - "source": [ - "!nvidia-smi" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Basic usage: Optimizing a TF2/Keras model with TensorRT in FP32:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Remember to sucessfully deploy a TensorRT model, you have to answer __five important questions__:\n", - "\n", - "1. __What format should I save my model in?__\n", - "2. __What batch size(s) am I running inference at?__\n", - "3. __What precision am I running inference at?__\n", - "4. __What TensorRT path am I using to convert my model?__\n", - "5. __What runtime am I targeting?__\n", - "\n", - "We will be following this path to convert and deploy our model:\n", - "\n", - "![TF-TRT](./images/tf_trt.png)\n", - "\n", - "Lets address these five questions here!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. What format should I save my model in?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For TF-TRT, we need our models to be in [SavedModel format](https://www.tensorflow.org/guide/saved_model). We can load up, for example, a Keras model and save it appropriately as follows:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "!mkdir -p tmp_savedmodels" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "from tensorflow.keras.applications import ResNet50\n", - "\n", - "model_dir = 'tmp_savedmodels/resnet50_saved_model'\n", - "model = ResNet50(include_top=True, weights='imagenet')" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.\n", - "Instructions for updating:\n", - "This property should not be used in TensorFlow 2.0, as updates are applied automatically.\n", - "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.\n", - "Instructions for updating:\n", - "This property should not be used in TensorFlow 2.0, as updates are applied automatically.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/resnet50_saved_model/assets\n" - ] - } - ], - "source": [ - "model.save(model_dir) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. What batch size(s) am I running inference at?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here we generate a dummy batch of data to pass into the network just to get an understanding of its performance. This is normally where you would supply a numpy batch of images." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "BATCH_SIZE = 32\n", - "\n", - "dummy_input_batch = np.zeros((BATCH_SIZE, 224, 224, 3))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. What precision am I running inference at?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will start with FP32 precision as a baseline! Later in this notebook, we will go through and look at how we can reduce our precision from the default." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "PRECISION = \"FP32\" # Options are \"FP32\", \"FP16\", or \"INT8\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. What TensorRT path am I using to convert my model?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will be using a simplified wrapper (ModelOptimizer) around TF-TRT to handle our conversions for this notebook. The wrapper is bare bones, meant as a springboard for further develoment - not a finished product. It can help us easily and quickly convert a TF-TRT model to a number of precisions." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "from helper import ModelOptimizer # using the helper from \n", - "\n", - "model_dir = 'tmp_savedmodels/resnet50_saved_model'\n", - "\n", - "opt_model = ModelOptimizer(model_dir)" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_0 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/resnet50_saved_model_FP32/assets\n" - ] - } - ], - "source": [ - "model_fp32 = opt_model.convert(model_dir+'_FP32', precision=PRECISION)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. What TensorRT runtime am I targeting?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "TF-TRT essentially yields a Tensorflow graph with some optimized TensorRT operations included in it. We can run this graph with .predict() like we would any other Tensorflow model." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([[1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " ...,\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04]], dtype=float32)" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model_fp32.predict(dummy_input_batch)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We now have a finished TF-TRT optimized Tensorflow graph!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__We can now compare the TensorRT optimized model with the original:__" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([[1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " ...,\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04]], dtype=float32)" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Warm up - the first batch through a model generally takes longer\n", - "model.predict(dummy_input_batch)\n", - "model_fp32.predict(dummy_input_batch)" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "53.5 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "model.predict_on_batch(dummy_input_batch)" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "29.5 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "model_fp32.predict(dummy_input_batch)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Reducing Precision:\n", - "\n", - "Inference typically requires less numeric precision than training. With some care, lower precision can give you faster computation and lower memory consumption without sacrificing any meaningful accuracy. TensorRT supports TF32, FP32, FP16, and INT8 precisions." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Reducing precision to FP16:__" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "FP16 \"mixed precision\" inference gives up some accuracy in exchange for faster models with lower latency and lower memory footprint. In practice, the accuracy loss is generally negligible in FP16 - so FP16 is a fairly safe bet in most cases for inference. Cards that are focused on deep learning training often have strong FP16 capabilities, making FP16 a great choice for GPUs that are expected to be used for both training and inference - such as the NVIDIA V100\n", - "\n", - "Let's convert our model to FP16 and see how it performs:" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Could not find TRTEngineOp_1_0 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/resnet50_saved_model_FP16/assets\n" - ] - }, - { - "data": { - "text/plain": [ - "array([[1.7182514e-04, 3.3864001e-04, 6.3493084e-05, ..., 1.5010530e-05,\n", - " 1.4759685e-04, 6.7664997e-04],\n", - " [1.7182514e-04, 3.3864001e-04, 6.3493084e-05, ..., 1.5010530e-05,\n", - " 1.4759685e-04, 6.7664997e-04],\n", - " [1.7182514e-04, 3.3864001e-04, 6.3493084e-05, ..., 1.5010530e-05,\n", - " 1.4759685e-04, 6.7664997e-04],\n", - " ...,\n", - " [1.7182514e-04, 3.3864001e-04, 6.3493084e-05, ..., 1.5010530e-05,\n", - " 1.4759685e-04, 6.7664997e-04],\n", - " [1.7182514e-04, 3.3864001e-04, 6.3493084e-05, ..., 1.5010530e-05,\n", - " 1.4759685e-04, 6.7664997e-04],\n", - " [1.7182514e-04, 3.3864001e-04, 6.3493084e-05, ..., 1.5010530e-05,\n", - " 1.4759685e-04, 6.7664997e-04]], dtype=float32)" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model_fp16 = opt_model.convert(model_dir+'_FP16', precision=\"FP16\")\n", - "\n", - "model_fp16.predict(dummy_input_batch)" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "13.5 ms ± 20.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "model_fp16.predict(dummy_input_batch)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Reducing precision to INT8:__\n", - "\n", - "Whether you want to further reduce to INT8 precision depends on hardware - Turing cards and later INT8 is often better. Inference focused cards such as the NVIDIA T4 or systems-on-module such as Jetson AGX Xavier do well with INT8. In contrast, on a training-focused GPU like V100, INT8 often isn't any faster than FP16." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To perform INT8 inference, we need to see what the normal range of activations are in the network so we can quantize our INT8 representations based on a normal set of values for our dataset. It is important that this dataset is representative of the testing samples in order to maintain accuracy levels.\n", - "\n", - "Here, we just want to see how our network performs in TensorRT from a runtime standpoint - so we will just feed dummy data and dummy calibration data into TensorRT." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "dummy_calibration_batch = np.zeros((8, 224, 224, 3))\n", - "\n", - "opt_model.set_calibration_data(dummy_calibration_batch)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Then, we convert our model to INT8 as before:" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/resnet50_saved_model_INT8/assets\n" - ] - }, - { - "data": { - "text/plain": [ - "array([[1.61497956e-04, 3.58211488e-04, 7.12977999e-05, ...,\n", - " 1.43723055e-05, 1.47045619e-04, 7.21490127e-04],\n", - " [1.61497956e-04, 3.58211488e-04, 7.12977999e-05, ...,\n", - " 1.43723055e-05, 1.47045619e-04, 7.21490127e-04],\n", - " [1.61497956e-04, 3.58211488e-04, 7.12977999e-05, ...,\n", - " 1.43723055e-05, 1.47045619e-04, 7.21490127e-04],\n", - " ...,\n", - " [1.61497956e-04, 3.58211488e-04, 7.12977999e-05, ...,\n", - " 1.43723055e-05, 1.47045619e-04, 7.21490127e-04],\n", - " [1.61497956e-04, 3.58211488e-04, 7.12977999e-05, ...,\n", - " 1.43723055e-05, 1.47045619e-04, 7.21490127e-04],\n", - " [1.61497956e-04, 3.58211488e-04, 7.12977999e-05, ...,\n", - " 1.43723055e-05, 1.47045619e-04, 7.21490127e-04]], dtype=float32)" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model_int8 = opt_model.convert(model_dir+'_INT8', precision=\"INT8\")\n", - "\n", - "model_int8.predict(dummy_input_batch)" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "13.1 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "model_int8.predict(dummy_input_batch)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Next Steps:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can find other Jupyter Notebooks demonstrating TF-TRT conversions and end to end workflows for many other Keras applications and models, including detection models and segmentation models, in other example TF-TRT notebooks!\n", - "\n", - "Here are links to those notebooks:\n", - "\n", - "[__Classification Examples__](./Additional%20Examples/1.%20TF-TRT%20Classification.ipynb)\n", - "\n", - "[__Detection Example__](./Additional%20Examples/2.%20TF-TRT%20Detection.ipynb)\n", - "\n", - "[__Segmentation Example__](./Additional%20Examples/3.%20TF-TRT%20Segmentation.ipynb)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.9" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/quickstart/IntroNotebooks/3. Using Tensorflow 2 through ONNX.ipynb b/quickstart/IntroNotebooks/3. Using Tensorflow 2 through ONNX.ipynb deleted file mode 100644 index aa8f6328..00000000 --- a/quickstart/IntroNotebooks/3. Using Tensorflow 2 through ONNX.ipynb +++ /dev/null @@ -1,1275 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Using Tensorflow through ONNX:\n", - "\n", - "The ONNX path to getting a TensorRT engine is a high-performance approach to TensorRT conversion that works with a variety of frameworks - including Tensorflow and Tensorflow 2.\n", - "\n", - "TensorRT's ONNX parser is an all-or-nothing parser for ONNX models that ensures an optimal, single TensorRT engine and is great for exporting to the TensorRT API runtimes. ONNX models can be easily generated from Tensorflow models using the ONNX project's tf2onnx tool.\n", - "\n", - "In this notebook we will take a look at how ONNX models can be generated from a Keras/TF2 ResNet50 model, how we can convert those ONNX models to TensorRT engines using trtexec, and finally how we can use the native Python TensorRT runtime to feed a batch of data into the TRT engine at inference time.\n", - "\n", - "Essentially, we will follow this path to convert and deploy our model:\n", - "\n", - "![Tensorflow+ONNX](./images/tf_onnx.png)\n", - "\n", - "__Use this when:__\n", - "- You want the most efficient runtime performance possible out of an automatic parser\n", - "- You have a network consisting of mostly supported operations - including operations and layers that the ONNX parser uniquely supports (Such as RNNs/LSTMs/GRUs)\n", - "- You are willing to write custom C++ plugins for any unsupported operations (if your network has any)\n", - "- You do not want to use the manual layer builder API" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Checking your GPU status:__\n", - "\n", - "Lets see what GPU hardware we are working with. Our hardware can matter a lot because different cards have different performance profiles and precisions they tend to operate best in. For example, a V100 is relatively strong as FP16 processing vs a T4, which tends to operate best in the INT8 mode." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 377 - }, - "id": "IJBfZsGo8yaV", - "outputId": "f4c4e20d-fcfd-43a2-b10d-c6978c25c91f" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Wed Jun 9 19:47:48 2021 \n", - "+-----------------------------------------------------------------------------+\n", - "| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.3 |\n", - "|-------------------------------+----------------------+----------------------+\n", - "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", - "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", - "| | | MIG M. |\n", - "|===============================+======================+======================|\n", - "| 0 Tesla V100-DGXS... On | 00000000:07:00.0 Off | 0 |\n", - "| N/A 45C P0 63W / 300W | 5572MiB / 16155MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 |\n", - "| N/A 44C P0 41W / 300W | 9MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 2 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 |\n", - "| N/A 43C P0 41W / 300W | 9MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 3 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 |\n", - "| N/A 44C P0 39W / 300W | 9MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - " \n", - "+-----------------------------------------------------------------------------+\n", - "| Processes: |\n", - "| GPU GI CI PID Type Process name GPU Memory |\n", - "| ID ID Usage |\n", - "|=============================================================================|\n", - "+-----------------------------------------------------------------------------+\n" - ] - } - ], - "source": [ - "!nvidia-smi" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Remember to sucessfully deploy a TensorRT model, you have to make __five key decisions__:\n", - "\n", - "1. __What format should I save my model in?__\n", - "2. __What batch size(s) am I running inference at?__\n", - "3. __What precision am I running inference at?__\n", - "4. __What TensorRT path am I using to convert my model?__\n", - "5. __What runtime am I targeting?__" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. What format should I save my model in?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Our first step is to load up a pretrained ResNet50 model. This can be done easily using keras.applications - a collection of pretrained image model classifiers that can additionally be used as backbones for detection and other deep learning problems.\n", - "\n", - "We can load up a pretrained classifier with batch size 32 as follows:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "id": "iVRVItvR8quS" - }, - "outputs": [], - "source": [ - "from tensorflow.keras.applications import ResNet50\n", - "\n", - "BATCH_SIZE = 32" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "id": "cKT07xPV8qua" - }, - "outputs": [], - "source": [ - "model = ResNet50(weights='imagenet')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For the purposes of checking our non-optimized model, we can use a dummy batch of data to verify our performance and the consistency of our results across precisions. 224x224 RGB images are a common format, so lets generate a batch of them.\n", - "\n", - "Once we generate a batch of them, we will feed it through the model using .predict() to \"warm up\" the model. The first batch you feed through a deep learning model often takes a lot longer as just-in-time compilation and other runtime optimizations are performed. Once you get that first batch through, further performance tends to be more consistent.\n", - "\n", - "To create a test batch, we will simply repeat one open-source dog image from http://www.dog.ceo" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(32, 224, 224, 3)" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import numpy as np\n", - "from skimage import io\n", - "from skimage.transform import resize\n", - "from matplotlib import pyplot as plt\n", - "\n", - "url='https://images.dog.ceo/breeds/retriever-golden/n02099601_3004.jpg'\n", - "img = resize(io.imread(url), (224, 224))\n", - "input_batch = 255*np.array(np.repeat(np.expand_dims(np.array(img, dtype=np.float32), axis=0), BATCH_SIZE, axis=0), dtype=np.float32)\n", - "\n", - "input_batch.shape" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "plt.imshow(input_batch[0]/255)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The image above is a Golden Retriever, class 207 in ImageNet. So we look for class 207 in the top 5 predictions to verify our model works as intended:" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Class | Probability (out of 1)\n" - ] - }, - { - "data": { - "text/plain": [ - "[(160, 0.32290387),\n", - " (169, 0.266499),\n", - " (212, 0.16812354),\n", - " (170, 0.07066823),\n", - " (207, 0.03341851)]" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "predictions = model.predict(input_batch) # warm up\n", - "indices = (-predictions[0]).argsort()[:5]\n", - "print(\"Class | Probability (out of 1)\")\n", - "list(zip(indices, predictions[0][indices]))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Labels 150 to 275 or so are dogs in ImageNet, so look for those as other common predictions in addition to our correct 207 class." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Baseline Timing:__\n", - "\n", - "Once we have warmed up our non-optimized model, we can get a rough timing estimate of our model using %%timeit, which runs the cell several times and reports timing information.\n", - "\n", - "Lets take a look at how long our model takes to run at baseline before doing any TensorRT optimization:" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 85 - }, - "id": "eMu3dZlM96bh", - "outputId": "537a88e2-ad7d-413a-f815-abd91f010e21" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "46.8 ms ± 514 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "result = model.predict_on_batch(input_batch) # Check default performance" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Okay - now that we have a baseline model, lets convert it to the format TensorRT understands best: ONNX. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Convert Keras model to ONNX intermediate model and save:__" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The ONNX format is a framework-agnostic way of describing and saving the structure and state of deep learning models. We can convert Tensorflow 2 Keras models to ONNX using the tf2onnx tool provided by the ONNX project. (You can find the ONNX project here: https://onnx.ai or on GitHub here: https://github.com/onnx/onnx)" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "aG3tXUEx8quf" - }, - "outputs": [], - "source": [ - "import onnx" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Converting a model with default parameters to an ONNX model is fairly straightforward:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 68 - }, - "id": "QxLAvWp68quk", - "outputId": "d750962a-d098-4a63-c195-c3442211cdc1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: my_model/assets\n", - "2021-06-09 19:48:30.462380: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n", - "/usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'tf2onnx.convert' found in sys.modules after import of package 'tf2onnx', but prior to execution of 'tf2onnx.convert'; this may result in unpredictable behaviour\n", - " warn(RuntimeWarning(msg))\n", - "2021-06-09 19:48:31.938818: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set\n", - "2021-06-09 19:48:31.939684: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1\n", - "2021-06-09 19:48:32.010614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 0 with properties: \n", - "pciBusID: 0000:07:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:32.011850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 1 with properties: \n", - "pciBusID: 0000:08:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:32.013128: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 2 with properties: \n", - "pciBusID: 0000:0e:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:32.014344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 3 with properties: \n", - "pciBusID: 0000:0f:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:32.014373: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n", - "2021-06-09 19:48:32.019097: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11\n", - "2021-06-09 19:48:32.019146: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11\n", - "2021-06-09 19:48:32.020281: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n", - "2021-06-09 19:48:32.020567: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n", - "2021-06-09 19:48:32.021254: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11\n", - "2021-06-09 19:48:32.022280: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11\n", - "2021-06-09 19:48:32.022445: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8\n", - "2021-06-09 19:48:32.030879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1889] Adding visible gpu devices: 0, 1, 2, 3\n", - "2021-06-09 19:48:32.032680: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set\n", - "2021-06-09 19:48:33.010741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 0 with properties: \n", - "pciBusID: 0000:07:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:33.011970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 1 with properties: \n", - "pciBusID: 0000:08:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:33.013195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 2 with properties: \n", - "pciBusID: 0000:0e:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:33.014389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 3 with properties: \n", - "pciBusID: 0000:0f:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:33.014428: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n", - "2021-06-09 19:48:33.014458: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11\n", - "2021-06-09 19:48:33.014478: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11\n", - "2021-06-09 19:48:33.014497: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n", - "2021-06-09 19:48:33.014516: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n", - "2021-06-09 19:48:33.014534: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11\n", - "2021-06-09 19:48:33.014552: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11\n", - "2021-06-09 19:48:33.014571: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8\n", - "2021-06-09 19:48:33.022970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1889] Adding visible gpu devices: 0, 1, 2, 3\n", - "2021-06-09 19:48:33.023016: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n", - "2021-06-09 19:48:35.609734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1287] Device interconnect StreamExecutor with strength 1 edge matrix:\n", - "2021-06-09 19:48:35.609783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1293] 0 1 2 3 \n", - "2021-06-09 19:48:35.609797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 0: N Y Y Y \n", - "2021-06-09 19:48:35.609806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 1: Y N Y Y \n", - "2021-06-09 19:48:35.609816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 2: Y Y N Y \n", - "2021-06-09 19:48:35.609825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 3: Y Y Y N \n", - "2021-06-09 19:48:35.619000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 203 MB memory) -> physical GPU (device: 0, name: Tesla V100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:35.620513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14206 MB memory) -> physical GPU (device: 1, name: Tesla V100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:35.621962: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14206 MB memory) -> physical GPU (device: 2, name: Tesla V100-DGXS-16GB, pci bus id: 0000:0e:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:35.623398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14206 MB memory) -> physical GPU (device: 3, name: Tesla V100-DGXS-16GB, pci bus id: 0000:0f:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:35,625 - WARNING - '--tag' not specified for saved_model. Using --tag serve\n", - "2021-06-09 19:48:43,221 - INFO - Signatures found in model: [serving_default].\n", - "2021-06-09 19:48:43,221 - WARNING - '--signature_def' not specified, using first signature: serving_default\n", - "2021-06-09 19:48:43,222 - INFO - Output names: ['predictions']\n", - "2021-06-09 19:48:43.250962: I tensorflow/core/grappler/devices.cc:69] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 4\n", - "2021-06-09 19:48:43.251124: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session\n", - "2021-06-09 19:48:43.251388: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set\n", - "2021-06-09 19:48:43.252059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 0 with properties: \n", - "pciBusID: 0000:07:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:43.253259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 1 with properties: \n", - "pciBusID: 0000:08:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:43.254444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 2 with properties: \n", - "pciBusID: 0000:0e:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:43.255627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 3 with properties: \n", - "pciBusID: 0000:0f:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:43.255663: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n", - "2021-06-09 19:48:43.255693: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11\n", - "2021-06-09 19:48:43.255712: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11\n", - "2021-06-09 19:48:43.255730: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n", - "2021-06-09 19:48:43.255748: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n", - "2021-06-09 19:48:43.255765: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11\n", - "2021-06-09 19:48:43.255783: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11\n", - "2021-06-09 19:48:43.255801: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8\n", - "2021-06-09 19:48:43.264001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1889] Adding visible gpu devices: 0, 1, 2, 3\n", - "2021-06-09 19:48:43.264071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1287] Device interconnect StreamExecutor with strength 1 edge matrix:\n", - "2021-06-09 19:48:43.264086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1293] 0 1 2 3 \n", - "2021-06-09 19:48:43.264097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 0: N Y Y Y \n", - "2021-06-09 19:48:43.264106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 1: Y N Y Y \n", - "2021-06-09 19:48:43.264116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 2: Y Y N Y \n", - "2021-06-09 19:48:43.264125: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 3: Y Y Y N \n", - "2021-06-09 19:48:43.269085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 203 MB memory) -> physical GPU (device: 0, name: Tesla V100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:43.270297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14206 MB memory) -> physical GPU (device: 1, name: Tesla V100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:43.271732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14206 MB memory) -> physical GPU (device: 2, name: Tesla V100-DGXS-16GB, pci bus id: 0000:0e:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:43.273448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14206 MB memory) -> physical GPU (device: 3, name: Tesla V100-DGXS-16GB, pci bus id: 0000:0f:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:43.293134: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2198860000 Hz\n", - "2021-06-09 19:48:43.355209: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] Optimization results for grappler item: graph_to_optimize\n", - " function_optimizer: Graph size after: 1253 nodes (930), 1908 edges (1585), time = 33.193ms.\n", - " function_optimizer: function_optimizer did nothing. time = 0.577ms.\n", - "\n", - "2021-06-09 19:48:46.008484: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set\n", - "2021-06-09 19:48:46.031017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 0 with properties: \n", - "pciBusID: 0000:07:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:46.033674: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 1 with properties: \n", - "pciBusID: 0000:08:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:46.035311: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 2 with properties: \n", - "pciBusID: 0000:0e:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:46.036940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 3 with properties: \n", - "pciBusID: 0000:0f:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:46.036986: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n", - "2021-06-09 19:48:46.037035: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11\n", - "2021-06-09 19:48:46.037062: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11\n", - "2021-06-09 19:48:46.037086: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n", - "2021-06-09 19:48:46.037110: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n", - "2021-06-09 19:48:46.037133: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11\n", - "2021-06-09 19:48:46.037157: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11\n", - "2021-06-09 19:48:46.037181: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8\n", - "2021-06-09 19:48:46.046998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1889] Adding visible gpu devices: 0, 1, 2, 3\n", - "2021-06-09 19:48:46.047077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1287] Device interconnect StreamExecutor with strength 1 edge matrix:\n", - "2021-06-09 19:48:46.047095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1293] 0 1 2 3 \n", - "2021-06-09 19:48:46.047108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 0: N Y Y Y \n", - "2021-06-09 19:48:46.047120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 1: Y N Y Y \n", - "2021-06-09 19:48:46.047131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 2: Y Y N Y \n", - "2021-06-09 19:48:46.047142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 3: Y Y Y N \n", - "2021-06-09 19:48:46.052418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 203 MB memory) -> physical GPU (device: 0, name: Tesla V100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:46.053664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14206 MB memory) -> physical GPU (device: 1, name: Tesla V100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:46.054881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14206 MB memory) -> physical GPU (device: 2, name: Tesla V100-DGXS-16GB, pci bus id: 0000:0e:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:46.056098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14206 MB memory) -> physical GPU (device: 3, name: Tesla V100-DGXS-16GB, pci bus id: 0000:0f:00.0, compute capability: 7.0)\n", - "WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tf2onnx/tf_loader.py:603: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.\n", - "Instructions for updating:\n", - "Use `tf.compat.v1.graph_util.extract_sub_graph`\n", - "2021-06-09 19:48:46,541 - WARNING - From /usr/local/lib/python3.8/dist-packages/tf2onnx/tf_loader.py:603: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.\n", - "Instructions for updating:\n", - "Use `tf.compat.v1.graph_util.extract_sub_graph`\n", - "2021-06-09 19:48:46.600644: I tensorflow/core/grappler/devices.cc:69] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 4\n", - "2021-06-09 19:48:46.600797: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session\n", - "2021-06-09 19:48:46.601148: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set\n", - "2021-06-09 19:48:46.602435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 0 with properties: \n", - "pciBusID: 0000:07:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:46.604322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 1 with properties: \n", - "pciBusID: 0000:08:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:46.606193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 2 with properties: \n", - "pciBusID: 0000:0e:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:46.608049: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 3 with properties: \n", - "pciBusID: 0000:0f:00.0 name: Tesla V100-DGXS-16GB computeCapability: 7.0\n", - "coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s\n", - "2021-06-09 19:48:46.608091: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n", - "2021-06-09 19:48:46.608129: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11\n", - "2021-06-09 19:48:46.608153: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11\n", - "2021-06-09 19:48:46.608176: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n", - "2021-06-09 19:48:46.608198: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n", - "2021-06-09 19:48:46.608220: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11\n", - "2021-06-09 19:48:46.608242: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11\n", - "2021-06-09 19:48:46.608265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8\n", - "2021-06-09 19:48:46.625482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1889] Adding visible gpu devices: 0, 1, 2, 3\n", - "2021-06-09 19:48:46.625560: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1287] Device interconnect StreamExecutor with strength 1 edge matrix:\n", - "2021-06-09 19:48:46.625578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1293] 0 1 2 3 \n", - "2021-06-09 19:48:46.625590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 0: N Y Y Y \n", - "2021-06-09 19:48:46.625601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 1: Y N Y Y \n", - "2021-06-09 19:48:46.625612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 2: Y Y N Y \n", - "2021-06-09 19:48:46.625623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 3: Y Y Y N \n", - "2021-06-09 19:48:46.634557: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 203 MB memory) -> physical GPU (device: 0, name: Tesla V100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:46.636578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14206 MB memory) -> physical GPU (device: 1, name: Tesla V100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:46.638422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14206 MB memory) -> physical GPU (device: 2, name: Tesla V100-DGXS-16GB, pci bus id: 0000:0e:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:46.640290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14206 MB memory) -> physical GPU (device: 3, name: Tesla V100-DGXS-16GB, pci bus id: 0000:0f:00.0, compute capability: 7.0)\n", - "2021-06-09 19:48:47.379855: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] Optimization results for grappler item: graph_to_optimize\n", - " constant_folding: Graph size after: 560 nodes (-640), 1215 edges (-640), time = 399.986ms.\n", - " function_optimizer: function_optimizer did nothing. time = 1.17ms.\n", - " constant_folding: Graph size after: 560 nodes (0), 1215 edges (0), time = 101.728ms.\n", - " function_optimizer: function_optimizer did nothing. time = 1.017ms.\n", - "\n", - "2021-06-09 19:48:47,938 - INFO - Using tensorflow=2.4.0, onnx=1.9.0, tf2onnx=1.8.5/50049d\n", - "2021-06-09 19:48:47,939 - INFO - Using opset \n", - "2021-06-09 19:48:52,720 - INFO - Computed 0 values for constant folding\n", - "2021-06-09 19:49:05,218 - INFO - Optimizing ONNX model\n", - "2021-06-09 19:49:06,920 - INFO - After optimization: Add -1 (18->17), BatchNormalization -53 (53->0), Const -162 (270->108), GlobalAveragePool +1 (0->1), Identity -57 (57->0), ReduceMean -1 (1->0), Squeeze +1 (0->1), Transpose -213 (214->1)\n", - "2021-06-09 19:49:07,076 - INFO - \n", - "2021-06-09 19:49:07,076 - INFO - Successfully converted TensorFlow model my_model to ONNX\n", - "2021-06-09 19:49:07,076 - INFO - Model inputs: ['input_1:0']\n", - "2021-06-09 19:49:07,076 - INFO - Model outputs: ['predictions']\n", - "2021-06-09 19:49:07,076 - INFO - ONNX model is saved at temp.onnx\n" - ] - } - ], - "source": [ - "model.save('my_model')\n", - "!python -m tf2onnx.convert --saved-model my_model --output temp.onnx\n", - "onnx_model = onnx.load_model('temp.onnx')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "That said, we do need to make one change for our model to work with TensorRT. Keras by default uses a dynamic input shape in its networks - where it can handle arbitrary batch sizes at every update. While TensorRT can do this, it requires extra configuration. \n", - "\n", - "Instead, we will just set the input size to be fixed to our batch size. This will work with TensorRT out of the box!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Configure ONNX File Batch Size:__" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Note:__ We need to do two things to set our batch size with ONNX. The first is to modify our ONNX file to change its default batch size to our target batch size. The second is setting our converter to use the __explicit batch__ mode, which will use this default batch size as our final batch size." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "inputs = onnx_model.graph.input\n", - "for input in inputs:\n", - " dim1 = input.type.tensor_type.shape.dim[0]\n", - " dim1.dim_value = BATCH_SIZE" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Save Model:__" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "id": "jFT6-13f8qup" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Done saving!\n" - ] - } - ], - "source": [ - "model_name = \"resnet50_onnx_model.onnx\"\n", - "onnx.save_model(onnx_model, model_name)\n", - "print(\"Done saving!\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Once we get our model into ONNX format, we can convert it efficiently using TensorRT. For this, TensorRT needs exclusive access to your GPU. If you so much as import Tensorflow, it will generally consume all of your GPU memory. To get around this, before moving on go ahead and shut down this notebook and restart it. (You can do this in the menu: Kernel -> Restart Kernel)\n", - "\n", - "Make sure not to import Tensorflow at any point after restarting the runtime! \n", - "\n", - "(The following cell is a quick shortcut to make your notebook restart:)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "uZUnHVHE8quu" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Restarting kernel in three seconds...\n" - ] - } - ], - "source": [ - "import os, time\n", - "print(\"Restarting kernel in three seconds...\")\n", - "time.sleep(3)\n", - "print(\"Restarting kernel now\")\n", - "os._exit(0) # Shut down all kernels so TRT doesn't fight with Tensorflow for GPU memory - TF monopolizes all GPU memory by default" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. What batch size(s) am I running inference at?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We have actually already set our inference batch size - see the note above in section 1!\n", - "\n", - "We are going to set our target batch size to a fixed size of 32." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "BATCH_SIZE = 32" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We need to do two things to set our batch size to a fixed batch size with ONNX: \n", - "\n", - "1. Modify our ONNX file to change its default batch size to our target batch size, which we did above.\n", - "2. Use the trtexec --explicitBatch flag, which we also did above." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. What precision am I running inference at?\n", - "\n", - "Now, we have a converted TensorRT engine. Great! That means we are ready to load it into the native Python TensorRT runtime. This runtime strikes a balance between the ease of use of the high level Python runtimes and the low level C++ runtimes.\n", - "\n", - "First, as before, lets create a dummy batch. Importantly, by default TensorRT will use the input precision you give it as the default precision for the rest of the network. \n", - "\n", - "Remember that lower precisions than FP32 tend to run faster. There are two common reduced precision modes - FP16 and INT8. Graphics cards that are designed to do inference well often have an affinity for one of these two types. This guide was developed on an NVIDIA V100, which favors FP16, so we will use that here by default. INT8 is a more complicated process that requires a calibration step." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "USE_FP16 = True\n", - "\n", - "target_dtype = np.float16 if USE_FP16 else np.float32" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We generate a batch of repeating Golden Retriever images, as before. Make sure that for TensorRT the image is resized to the size your model expects. Tensorflow and TensorRT have different behavior for handling 'oversized' images - so this is a safe way of ensuring consistent results across the two." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "from skimage import io\n", - "from skimage.transform import resize\n", - "from matplotlib import pyplot as plt\n", - "\n", - "url='https://images.dog.ceo/breeds/retriever-golden/n02099601_3004.jpg'\n", - "img = resize(io.imread(url), (224, 224))\n", - "input_batch = 255*np.array(np.repeat(np.expand_dims(np.array(img, dtype=np.float32), axis=0), BATCH_SIZE, axis=0), dtype=np.float32)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Only we must now cast the input batch to the proper FP32/FP16 precision:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "input_batch = input_batch.astype(target_dtype)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. What TensorRT path am I using to convert my model?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "TensorRT is able to take ONNX models and convert them entirely into a single, efficient TensorRT engine. Restart your Jupyter kernel, and then start here!\n", - "\n", - "We can use trtexec, a command line tool for working with TensorRT, in order to convert an ONNX model to an engine file.\n", - "\n", - "To convert the model we saved in the previous steps, we need to point to the ONNX file, give trtexec a name to save the engine as, and last specify that we want to use a fixed batch size instead of a dynamic one.\n", - "\n", - "__Remember to shut down all Jupyter notebooks and restart your Jupyter kernel after \"1. What format should I save my model in?\" - otherwise this cell will crash as TensorRT competes with Tensorflow for GPU memory:__" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 34 - }, - "id": "h60Gmotx8quz", - "outputId": "065384aa-c848-4194-c72c-cad0d80449ca" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "&&&& RUNNING TensorRT.trtexec # trtexec --onnx=resnet50_onnx_model.onnx --saveEngine=resnet_engine.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16\n", - "[06/09/2021-19:49:25] [I] === Model Options ===\n", - "[06/09/2021-19:49:25] [I] Format: ONNX\n", - "[06/09/2021-19:49:25] [I] Model: resnet50_onnx_model.onnx\n", - "[06/09/2021-19:49:25] [I] Output:\n", - "[06/09/2021-19:49:25] [I] === Build Options ===\n", - "[06/09/2021-19:49:25] [I] Max batch: explicit\n", - "[06/09/2021-19:49:25] [I] Workspace: 16 MiB\n", - "[06/09/2021-19:49:25] [I] minTiming: 1\n", - "[06/09/2021-19:49:25] [I] avgTiming: 8\n", - "[06/09/2021-19:49:25] [I] Precision: FP32+FP16\n", - "[06/09/2021-19:49:25] [I] Calibration: \n", - "[06/09/2021-19:49:25] [I] Refit: Disabled\n", - "[06/09/2021-19:49:25] [I] Safe mode: Disabled\n", - "[06/09/2021-19:49:25] [I] Save engine: resnet_engine.trt\n", - "[06/09/2021-19:49:25] [I] Load engine: \n", - "[06/09/2021-19:49:25] [I] Builder Cache: Enabled\n", - "[06/09/2021-19:49:25] [I] NVTX verbosity: 0\n", - "[06/09/2021-19:49:25] [I] Tactic sources: Using default tactic sources\n", - "[06/09/2021-19:49:25] [I] Input(s): fp16:chw\n", - "[06/09/2021-19:49:25] [I] Output(s): fp16:chw\n", - "[06/09/2021-19:49:25] [I] Input build shapes: model\n", - "[06/09/2021-19:49:25] [I] Input calibration shapes: model\n", - "[06/09/2021-19:49:25] [I] === System Options ===\n", - "[06/09/2021-19:49:25] [I] Device: 0\n", - "[06/09/2021-19:49:25] [I] DLACore: \n", - "[06/09/2021-19:49:25] [I] Plugins:\n", - "[06/09/2021-19:49:25] [I] === Inference Options ===\n", - "[06/09/2021-19:49:25] [I] Batch: Explicit\n", - "[06/09/2021-19:49:25] [I] Input inference shapes: model\n", - "[06/09/2021-19:49:25] [I] Iterations: 10\n", - "[06/09/2021-19:49:25] [I] Duration: 3s (+ 200ms warm up)\n", - "[06/09/2021-19:49:25] [I] Sleep time: 0ms\n", - "[06/09/2021-19:49:25] [I] Streams: 1\n", - "[06/09/2021-19:49:25] [I] ExposeDMA: Disabled\n", - "[06/09/2021-19:49:25] [I] Data transfers: Enabled\n", - "[06/09/2021-19:49:25] [I] Spin-wait: Disabled\n", - "[06/09/2021-19:49:25] [I] Multithreading: Disabled\n", - "[06/09/2021-19:49:25] [I] CUDA Graph: Disabled\n", - "[06/09/2021-19:49:25] [I] Separate profiling: Disabled\n", - "[06/09/2021-19:49:25] [I] Skip inference: Disabled\n", - "[06/09/2021-19:49:25] [I] Inputs:\n", - "[06/09/2021-19:49:25] [I] === Reporting Options ===\n", - "[06/09/2021-19:49:25] [I] Verbose: Disabled\n", - "[06/09/2021-19:49:25] [I] Averages: 10 inferences\n", - "[06/09/2021-19:49:25] [I] Percentile: 99\n", - "[06/09/2021-19:49:25] [I] Dump refittable layers:Disabled\n", - "[06/09/2021-19:49:25] [I] Dump output: Disabled\n", - "[06/09/2021-19:49:25] [I] Profile: Disabled\n", - "[06/09/2021-19:49:25] [I] Export timing to JSON file: \n", - "[06/09/2021-19:49:25] [I] Export output to JSON file: \n", - "[06/09/2021-19:49:25] [I] Export profile to JSON file: \n", - "[06/09/2021-19:49:25] [I] \n", - "[06/09/2021-19:49:25] [I] === Device Information ===\n", - "[06/09/2021-19:49:25] [I] Selected Device: Tesla V100-DGXS-16GB\n", - "[06/09/2021-19:49:25] [I] Compute Capability: 7.0\n", - "[06/09/2021-19:49:25] [I] SMs: 80\n", - "[06/09/2021-19:49:25] [I] Compute Clock Rate: 1.53 GHz\n", - "[06/09/2021-19:49:25] [I] Device Global Memory: 16155 MiB\n", - "[06/09/2021-19:49:25] [I] Shared Memory per SM: 96 KiB\n", - "[06/09/2021-19:49:25] [I] Memory Bus Width: 4096 bits (ECC enabled)\n", - "[06/09/2021-19:49:25] [I] Memory Clock Rate: 0.877 GHz\n", - "[06/09/2021-19:49:25] [I] \n", - "[06/09/2021-19:49:42] [I] [TRT] ----------------------------------------------------------------\n", - "[06/09/2021-19:49:42] [I] [TRT] Input filename: resnet50_onnx_model.onnx\n", - "[06/09/2021-19:49:42] [I] [TRT] ONNX IR version: 0.0.4\n", - "[06/09/2021-19:49:42] [I] [TRT] Opset version: 9\n", - "[06/09/2021-19:49:42] [I] [TRT] Producer name: tf2onnx\n", - "[06/09/2021-19:49:42] [I] [TRT] Producer version: 1.8.5\n", - "[06/09/2021-19:49:42] [I] [TRT] Domain: \n", - "[06/09/2021-19:49:42] [I] [TRT] Model version: 0\n", - "[06/09/2021-19:49:42] [I] [TRT] Doc string: \n", - "[06/09/2021-19:49:42] [I] [TRT] ----------------------------------------------------------------\n", - "[06/09/2021-19:49:48] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.\n", - "[06/09/2021-19:51:05] [I] [TRT] Detected 1 inputs and 1 output network tensors.\n", - "[06/09/2021-19:51:06] [I] Engine built in 100.683 sec.\n", - "[06/09/2021-19:51:06] [I] Starting inference\n", - "[06/09/2021-19:51:09] [I] Warmup completed 0 queries over 200 ms\n", - "[06/09/2021-19:51:09] [I] Timing trace has 0 queries over 2.99006 s\n", - "[06/09/2021-19:51:09] [I] Trace averages of 10 runs:\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48546 ms - Host latency: 6.30948 ms (end to end 10.0032 ms, enqueue 0.539108 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48946 ms - Host latency: 6.31468 ms (end to end 10.9038 ms, enqueue 0.516052 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48004 ms - Host latency: 6.3107 ms (end to end 10.8822 ms, enqueue 0.513507 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.49315 ms - Host latency: 6.34006 ms (end to end 10.4643 ms, enqueue 0.512753 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.52059 ms - Host latency: 6.36953 ms (end to end 10.2954 ms, enqueue 0.498505 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.50788 ms - Host latency: 6.3551 ms (end to end 9.11696 ms, enqueue 0.518701 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.49774 ms - Host latency: 6.3454 ms (end to end 10.9278 ms, enqueue 0.495056 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.50585 ms - Host latency: 6.35638 ms (end to end 10.9322 ms, enqueue 0.505725 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.50247 ms - Host latency: 6.35249 ms (end to end 10.5564 ms, enqueue 0.513574 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.51249 ms - Host latency: 6.36059 ms (end to end 9.63242 ms, enqueue 0.498096 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.4911 ms - Host latency: 6.33875 ms (end to end 8.90275 ms, enqueue 0.474237 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.50072 ms - Host latency: 6.34651 ms (end to end 10.4826 ms, enqueue 0.498499 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.49602 ms - Host latency: 6.34083 ms (end to end 10.92 ms, enqueue 0.486401 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.49089 ms - Host latency: 6.3358 ms (end to end 10.8925 ms, enqueue 0.490247 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48907 ms - Host latency: 6.33452 ms (end to end 10.1912 ms, enqueue 0.482959 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.47534 ms - Host latency: 6.31992 ms (end to end 8.9359 ms, enqueue 0.484119 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.47952 ms - Host latency: 6.32281 ms (end to end 10.4421 ms, enqueue 0.481885 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48701 ms - Host latency: 6.33408 ms (end to end 10.9013 ms, enqueue 0.491455 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48179 ms - Host latency: 6.33092 ms (end to end 10.885 ms, enqueue 0.505078 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48776 ms - Host latency: 6.33756 ms (end to end 10.3106 ms, enqueue 0.494629 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.47145 ms - Host latency: 6.31754 ms (end to end 9.37426 ms, enqueue 0.481995 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48057 ms - Host latency: 6.32472 ms (end to end 9.55609 ms, enqueue 0.480151 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48557 ms - Host latency: 6.33252 ms (end to end 10.4543 ms, enqueue 0.486841 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.50972 ms - Host latency: 6.35627 ms (end to end 10.9478 ms, enqueue 0.488062 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.50054 ms - Host latency: 6.34517 ms (end to end 10.0418 ms, enqueue 0.483325 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48201 ms - Host latency: 6.32832 ms (end to end 9.67512 ms, enqueue 0.481812 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48279 ms - Host latency: 6.32742 ms (end to end 9.18972 ms, enqueue 0.484082 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.47712 ms - Host latency: 6.32109 ms (end to end 10.879 ms, enqueue 0.482202 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.47788 ms - Host latency: 6.32166 ms (end to end 10.8823 ms, enqueue 0.481006 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48203 ms - Host latency: 6.32615 ms (end to end 10.6967 ms, enqueue 0.481055 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.46802 ms - Host latency: 6.31384 ms (end to end 9.47229 ms, enqueue 0.477344 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.4967 ms - Host latency: 6.3428 ms (end to end 8.9686 ms, enqueue 0.48147 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.49275 ms - Host latency: 6.33767 ms (end to end 9.57681 ms, enqueue 0.481714 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.52278 ms - Host latency: 6.37007 ms (end to end 10.9759 ms, enqueue 0.493896 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.49238 ms - Host latency: 6.34084 ms (end to end 10.7861 ms, enqueue 0.49917 ms)\n", - "[06/09/2021-19:51:09] [I] Average on 10 runs - GPU latency: 5.48333 ms - Host latency: 6.33235 ms (end to end 10.4963 ms, enqueue 0.500806 ms)\n", - "[06/09/2021-19:51:09] [I] Host Latency\n", - "[06/09/2021-19:51:09] [I] min: 6.28442 ms (end to end 6.327 ms)\n", - "[06/09/2021-19:51:09] [I] max: 6.66431 ms (end to end 11.2405 ms)\n", - "[06/09/2021-19:51:09] [I] mean: 6.33588 ms (end to end 10.2251 ms)\n", - "[06/09/2021-19:51:09] [I] median: 6.33411 ms (end to end 10.8945 ms)\n", - "[06/09/2021-19:51:09] [I] percentile: 6.38745 ms at 99% (end to end 11.0925 ms at 99%)\n", - "[06/09/2021-19:51:09] [I] throughput: 0 qps\n", - "[06/09/2021-19:51:09] [I] walltime: 2.99006 s\n", - "[06/09/2021-19:51:09] [I] Enqueue Time\n", - "[06/09/2021-19:51:09] [I] min: 0.413086 ms\n", - "[06/09/2021-19:51:09] [I] max: 0.796997 ms\n", - "[06/09/2021-19:51:09] [I] median: 0.486877 ms\n", - "[06/09/2021-19:51:09] [I] GPU Compute\n", - "[06/09/2021-19:51:09] [I] min: 5.4425 ms\n", - "[06/09/2021-19:51:09] [I] max: 5.82251 ms\n", - "[06/09/2021-19:51:09] [I] mean: 5.49097 ms\n", - "[06/09/2021-19:51:09] [I] median: 5.48969 ms\n", - "[06/09/2021-19:51:09] [I] percentile: 5.53986 ms at 99%\n", - "[06/09/2021-19:51:09] [I] total compute time: 2.00421 s\n", - "&&&& PASSED TensorRT.trtexec # trtexec --onnx=resnet50_onnx_model.onnx --saveEngine=resnet_engine.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16\n" - ] - } - ], - "source": [ - "# May need to shut down all kernels and restart before this - otherwise you might get cuDNN initialization errors:\n", - "if USE_FP16:\n", - " !trtexec --onnx=resnet50_onnx_model.onnx --saveEngine=resnet_engine.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16\n", - "else:\n", - " !trtexec --onnx=resnet50_onnx_model.onnx --saveEngine=resnet_engine.trt --explicitBatch" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "-\n", - "\n", - "__The trtexec Logs:__\n", - "\n", - "Above, trtexec does a lot of things! Some important things to note:\n", - "\n", - "__First__, _\"PASSED\"_ is what you want to see in the last line of the log above. We can see our conversion was successful!\n", - "\n", - "__Second__, can see the resnet_engine.trt engine file has indeed been successfully created: " - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "total 508284\n", - "drwxrwxr-x 8 1000 1000 4096 Jun 9 19:49 .\n", - "drwxrwxr-x 5 1000 1000 4096 Apr 5 23:28 ..\n", - "drwxr-xr-x 2 root root 4096 Apr 6 01:13 .ipynb_checkpoints\n", - "-rw-rw-r-- 1 1000 1000 34748 Jun 9 19:46 '0. Running This Guide.ipynb'\n", - "-rw-rw-r-- 1 1000 1000 502649 Apr 5 23:28 '1. Introduction.ipynb'\n", - "-rw-rw-r-- 1 1000 1000 23645 Apr 5 23:28 '2. Using the Tensorflow TensorRT Integration.ipynb'\n", - "-rw-rw-r-- 1 1000 1000 210995 Jun 9 19:49 '3. Using Tensorflow 2 through ONNX.ipynb'\n", - "-rw-rw-r-- 1 1000 1000 334050 Jun 9 19:17 '4. Using PyTorch through ONNX.ipynb'\n", - "-rw-rw-r-- 1 1000 1000 7052 Apr 5 23:28 '5. Understanding TensorRT Runtimes.ipynb'\n", - "drwxrwxr-x 2 1000 1000 4096 Apr 5 23:28 'Additional Examples'\n", - "drwxr-xr-x 2 root root 4096 Apr 5 23:28 Getting_Started\n", - "drwxr-xr-x 2 root root 4096 Apr 6 01:09 __pycache__\n", - "-rw-rw-r-- 1 1000 1000 4085 Apr 5 23:28 helper.py\n", - "drwxrwxr-x 2 1000 1000 4096 Apr 5 23:28 images\n", - "drwxr-xr-x 4 root root 4096 Jun 9 19:48 my_model\n", - "-rw-rw-r-- 1 1000 1000 3228 Apr 5 23:28 onnx_helper.py\n", - "-rw-r--r-- 1 root root 102169836 Jun 9 19:49 resnet50_onnx_model.onnx\n", - "-rw-r--r-- 1 root root 102470353 Apr 6 04:18 resnet50_pytorch.onnx\n", - "-rw-r--r-- 1 root root 51398352 Jun 9 19:51 resnet_engine.trt\n", - "-rw-r--r-- 1 root root 161081907 Apr 6 17:38 resnet_engine_pytorch.trt\n", - "-rw-r--r-- 1 root root 102169844 Jun 9 19:49 temp.onnx\n" - ] - } - ], - "source": [ - "!ls -la" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Third__, you can see timing details above using trtexec - these are in the ideal case with no overhead. Depending on how you run your model, a considerable amount of overhead can be added to this. We can do timing in our Python runtime below - but keep in mind performing C++ inference would likely be faster." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. What TensorRT runtime am I targeting?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We want to run our TensorRT inference in Python - so the TensorRT Python API is a great way of testing our model out in Jupyter, and is still quite performant.\n", - "\n", - "To use it, we need to do a few steps:\n", - "\n", - "__Load our engine into a tensorrt.Runtime:__" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "id": "dX2jFwrA8qu6" - }, - "outputs": [], - "source": [ - "import tensorrt as trt\n", - "import pycuda.driver as cuda\n", - "import pycuda.autoinit\n", - "\n", - "f = open(\"resnet_engine.trt\", \"rb\")\n", - "runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) \n", - "\n", - "engine = runtime.deserialize_cuda_engine(f.read())\n", - "context = engine.create_execution_context()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note: if this cell is having issues, restarting all Jupyter kernels and rerunning only the batch size and precision cells above before trying again often helps" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Allocate input and output memory, give TRT pointers (bindings) to it:__\n", - "\n", - "d_input and d_output refer to the memory regions on our 'device' (aka GPU) - as opposed to memory on our normal RAM, where Python holds its variables (such as 'output' below)." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "q3UJcdWy8qu8" - }, - "outputs": [], - "source": [ - "output = np.empty([BATCH_SIZE, 1000], dtype = target_dtype) # Need to set output dtype to FP16 to enable FP16\n", - "\n", - "# Allocate device memory\n", - "d_input = cuda.mem_alloc(1 * input_batch.nbytes)\n", - "d_output = cuda.mem_alloc(1 * output.nbytes)\n", - "\n", - "bindings = [int(d_input), int(d_output)]\n", - "\n", - "stream = cuda.Stream()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Set up prediction function:__\n", - "\n", - "This involves a copy from CPU RAM to GPU VRAM, executing the model, then copying the results back from GPU VRAM to CPU RAM:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "6R-F8JtV8qu-" - }, - "outputs": [], - "source": [ - "def predict(batch): # result gets copied into output\n", - " # Transfer input data to device\n", - " cuda.memcpy_htod_async(d_input, batch, stream)\n", - " # Execute model\n", - " context.execute_async_v2(bindings, stream.handle, None)\n", - " # Transfer predictions back\n", - " cuda.memcpy_dtoh_async(output, d_output, stream)\n", - " # Syncronize threads\n", - " stream.synchronize()\n", - " \n", - " return output" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This is all we need to run predictions using our TensorRT engine in a Python runtime!" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "AdKZzW7O8qvB" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Warming up...\n", - "Done warming up!\n" - ] - } - ], - "source": [ - "print(\"Warming up...\")\n", - "\n", - "trt_predictions = predict(input_batch).astype(np.float32)\n", - "\n", - "print(\"Done warming up!\")" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Class | Probability (out of 1)\n" - ] - }, - { - "data": { - "text/plain": [ - "[(160, 0.3112793),\n", - " (169, 0.27026367),\n", - " (212, 0.17321777),\n", - " (170, 0.07165527),\n", - " (207, 0.033843994)]" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "indices = (-trt_predictions[0]).argsort()[:5]\n", - "print(\"Class | Probability (out of 1)\")\n", - "list(zip(indices, trt_predictions[0][indices]))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that we have recovered our same predictions as before!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Performance Comparison:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Last, we can see how quickly we can feed a singular batch to TensorRT, which we can compare to our original Tensorflow experiment from earlier." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We use the %%timeit Jupyter magic again. Note that %%timeit is fairly rough, and for any actual benchmarking better controlled testing is required - preferably outside of Jupyter." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "id": "XAtWnCK38qvD" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "6.41 ms ± 846 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "_ = predict(input_batch) # Check TRT performance" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Next Steps:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "

Profiling

\n", - "\n", - "This is a great next step for further optimizing and debugging models you are working on productionizing\n", - "\n", - "You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html\n", - "\n", - "

TRT Dev Docs

\n", - "\n", - "Main documentation page for the ONNX, layer builder, C++, and legacy APIs\n", - "\n", - "You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html\n", - "\n", - "

TRT OSS GitHub

\n", - "\n", - "Contains OSS TRT components, sample applications, and plugin examples\n", - "\n", - "You can find it here: https://github.com/NVIDIA/TensorRT\n", - "\n", - "\n", - "#### TRT Supported Layers:\n", - "\n", - "https://github.com/NVIDIA/TensorRT/tree/main/samples/opensource/samplePlugin\n", - "\n", - "#### TRT ONNX Plugin Example:\n", - "\n", - "https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#layers-precision-matrix\n" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "name": "ONNXExample.ipynb", - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/quickstart/IntroNotebooks/4. Using PyTorch through ONNX.ipynb b/quickstart/IntroNotebooks/4. Using PyTorch through ONNX.ipynb deleted file mode 100644 index b90f9d49..00000000 --- a/quickstart/IntroNotebooks/4. Using PyTorch through ONNX.ipynb +++ /dev/null @@ -1,992 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Using PyTorch with TensorRT through ONNX:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "TensorRT is a great way to take a trained PyTorch model and optimize it to run more efficiently during inference on an NVIDIA GPU.\n", - "\n", - "One approach to convert a PyTorch model to TensorRT is to export a PyTorch model to ONNX (an open format exchange for deep learning models) and then convert into a TensorRT engine. Essentially, we will follow this path to convert and deploy our model:\n", - "\n", - "![PyTorch+ONNX](./images/pytorch_onnx.png)\n", - "\n", - "Both TensorFlow and PyTorch models can be exported to ONNX, as well as many other frameworks. This allows models created using either framework to flow into common downstream pipelines.\n", - "\n", - "To get started, let's take a well-known computer vision model and follow five key steps to deploy it to the TensorRT Python runtime:\n", - "\n", - "1. __What format should I save my model in?__\n", - "2. __What batch size(s) am I running inference at?__\n", - "3. __What precision am I running inference at?__\n", - "4. __What TensorRT path am I using to convert my model?__\n", - "5. __What runtime am I targeting?__" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. What format should I save my model in?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We are going to use ResNet50, a widely used CNN architecture first described in this paper.\n", - "\n", - "Let's start by loading dependencies and downloading the model. We will also move our Resnet model onto the GPU and set it to evaluation mode." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Downloading: \"https://download.pytorch.org/models/resnet50-19c8e357.pth\" to /root/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth\n" - ] - } - ], - "source": [ - "import torchvision.models as models\n", - "import torch\n", - "import torch.onnx\n", - "\n", - "# load the pretrained model\n", - "resnet50 = models.resnet50(pretrained=True, progress=False).eval()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When saving a model to ONNX, PyTorch requires a test batch in proper shape and format. We pick a batch size:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "BATCH_SIZE=32\n", - "\n", - "dummy_input=torch.randn(BATCH_SIZE, 3, 224, 224)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, we will export the model using the dummy input batch:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "# export the model to ONNX\n", - "torch.onnx.export(resnet50, dummy_input, \"resnet50_pytorch.onnx\", verbose=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that we are picking a BATCH_SIZE of 32 in this example." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Now Test with a Real Image:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's try a real image batch! For this example, we will simply repeat one open-source dog image from http://www.dog.ceo:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(32, 224, 224, 3)" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from skimage import io\n", - "from skimage.transform import resize\n", - "from matplotlib import pyplot as plt\n", - "import numpy as np\n", - "\n", - "url='https://images.dog.ceo/breeds/retriever-golden/n02099601_3004.jpg'\n", - "img = resize(io.imread(url), (224, 224))\n", - "img = np.expand_dims(np.array(img, dtype=np.float32), axis=0) # Expand image to have a batch dimension\n", - "input_batch = np.array(np.repeat(img, BATCH_SIZE, axis=0), dtype=np.float32) # Repeat across the batch dimension\n", - "\n", - "input_batch.shape" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "plt.imshow(input_batch[0].astype(np.float32))" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "resnet50_gpu = models.resnet50(pretrained=True, progress=False).to(\"cuda\").eval()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We need to move our batch onto GPU and properly format it to shape [32, 3, 224, 224]. " - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "torch.Size([32, 3, 224, 224])" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "input_batch_chw = torch.from_numpy(input_batch).transpose(1,3).transpose(2,3)\n", - "input_batch_gpu = input_batch_chw.to(\"cuda\")\n", - "\n", - "input_batch_gpu.shape" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can run a prediction on a batch using .forward():" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(32, 1000)" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "with torch.no_grad():\n", - " predictions = np.array(resnet50_gpu(input_batch_gpu).cpu())\n", - "\n", - "predictions.shape" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Verify Baseline Model Performance/Accuracy:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For a baseline, lets time our prediction in FP32:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "31.5 ms ± 72.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "with torch.no_grad():\n", - " preds = np.array(resnet50_gpu(input_batch_gpu).cpu())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can also time FP16 precision performance:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(32, 1000)" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "resnet50_gpu_half = resnet50_gpu.half()\n", - "input_half = input_batch_gpu.half()\n", - "\n", - "with torch.no_grad():\n", - " preds = np.array(resnet50_gpu_half(input_half).cpu()) # Warm Up\n", - " \n", - "preds.shape" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "19.4 ms ± 5.42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "with torch.no_grad():\n", - " preds = np.array(resnet50_gpu_half(input_half).cpu())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's also make sure our results are accurate. We will look at the top 5 accuracy on a single image prediction. The image we are using is of a Golden Retriever, which is class 207 in the ImageNet dataset our model was trained on." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Class | Likelihood\n" - ] - }, - { - "data": { - "text/plain": [ - "[(207, 13.121688),\n", - " (208, 9.614037),\n", - " (257, 9.361297),\n", - " (205, 8.777787),\n", - " (160, 8.557351)]" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "indices = (-predictions[0]).argsort()[:5]\n", - "print(\"Class | Likelihood\")\n", - "list(zip(indices, predictions[0][indices]))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We have a model exported to ONNX and a baseline to compare against! Let's now take our ONNX model and convert it to a TensorRT inference engine." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, let's restart our Jupyter Kernel so PyTorch doesn't collide with TensorRT: " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "os._exit(0) # Shut down all kernels so TRT doesn't fight with PyTorch for GPU memory" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. What batch size(s) am I running inference at?\n", - "\n", - "We are going to run with a fixed batch size of 32 for this example. Note that above we set BATCH_SIZE to 32 when saving our model to ONNX. We need to create another dummy batch of the same size (this time it will need to be in our target precision) to test out our engine.\n", - "\n", - "First, as before, we will set our BATCH_SIZE to 32. Note that our trtexec command above includes the '--explicitBatch' flag to signal to TensorRT that we will be using a fixed batch size at runtime." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "BATCH_SIZE = 32" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Importantly, by default TensorRT will use the input precision you give the runtime as the default precision for the rest of the network. So before we create our new dummy batch, we also need to choose a precision as in the next section:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. What precision am I running inference at?\n", - "\n", - "Remember that lower precisions than FP32 tend to run faster. There are two common reduced precision modes - FP16 and INT8. Graphics cards that are designed to do inference well often have an affinity for one of these two types. This guide was developed on an NVIDIA V100, which favors FP16, so we will use that here by default. INT8 is a more complicated process that requires a calibration step.\n", - "\n", - "__NOTE__: Make sure you use the same precision (USE_FP16) here you saved your model in above!" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "USE_FP16 = True\n", - "target_dtype = np.float16 if USE_FP16 else np.float32" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " To create a test batch, we will once again repeat one open-source dog image from http://www.dog.ceo:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(32, 224, 224, 3)" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from skimage import io\n", - "from skimage.transform import resize\n", - "from matplotlib import pyplot as plt\n", - "import numpy as np\n", - "\n", - "url='https://images.dog.ceo/breeds/retriever-golden/n02099601_3004.jpg'\n", - "img = resize(io.imread(url), (224, 224))\n", - "input_batch = np.array(np.repeat(np.expand_dims(np.array(img, dtype=np.float32), axis=0), BATCH_SIZE, axis=0), dtype=np.float32)\n", - "\n", - "input_batch.shape" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "plt.imshow(input_batch[0].astype(np.float32))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Preprocess Images:\n", - "\n", - "PyTorch has a normalization that it applies by default in all of its pretrained vision models - we can preprocess our images to match this normalization by the following, making sure our final result is in FP16 precision:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "import torch\n", - "from torchvision.transforms import Normalize\n", - "\n", - "def preprocess_image(img):\n", - " norm = Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])\n", - " result = norm(torch.from_numpy(img).transpose(0,2).transpose(1,2))\n", - " return np.array(result, dtype=np.float16)\n", - "\n", - "preprocessed_images = np.array([preprocess_image(image) for image in input_batch])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. What TensorRT path am I using to convert my model?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can use trtexec, a command line tool for working with TensorRT, in order to convert an ONNX model originally from PyTorch to an engine file.\n", - "\n", - "Let's make sure we have TensorRT installed (this comes with trtexec):" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "import tensorrt" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To convert the model we saved in the previous step, we need to point to the ONNX file, give trtexec a name to save the engine as, and last specify that we want to use a fixed batch size instead of a dynamic one." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "&&&& RUNNING TensorRT.trtexec # trtexec --onnx=resnet50_pytorch.onnx --saveEngine=resnet_engine_pytorch.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16\n", - "[06/09/2021-20:23:03] [I] === Model Options ===\n", - "[06/09/2021-20:23:03] [I] Format: ONNX\n", - "[06/09/2021-20:23:03] [I] Model: resnet50_pytorch.onnx\n", - "[06/09/2021-20:23:03] [I] Output:\n", - "[06/09/2021-20:23:03] [I] === Build Options ===\n", - "[06/09/2021-20:23:03] [I] Max batch: explicit\n", - "[06/09/2021-20:23:03] [I] Workspace: 16 MiB\n", - "[06/09/2021-20:23:03] [I] minTiming: 1\n", - "[06/09/2021-20:23:03] [I] avgTiming: 8\n", - "[06/09/2021-20:23:03] [I] Precision: FP32+FP16\n", - "[06/09/2021-20:23:03] [I] Calibration: \n", - "[06/09/2021-20:23:03] [I] Refit: Disabled\n", - "[06/09/2021-20:23:03] [I] Safe mode: Disabled\n", - "[06/09/2021-20:23:03] [I] Save engine: resnet_engine_pytorch.trt\n", - "[06/09/2021-20:23:03] [I] Load engine: \n", - "[06/09/2021-20:23:03] [I] Builder Cache: Enabled\n", - "[06/09/2021-20:23:03] [I] NVTX verbosity: 0\n", - "[06/09/2021-20:23:03] [I] Tactic sources: Using default tactic sources\n", - "[06/09/2021-20:23:03] [I] Input(s): fp16:chw\n", - "[06/09/2021-20:23:03] [I] Output(s): fp16:chw\n", - "[06/09/2021-20:23:03] [I] Input build shapes: model\n", - "[06/09/2021-20:23:03] [I] Input calibration shapes: model\n", - "[06/09/2021-20:23:03] [I] === System Options ===\n", - "[06/09/2021-20:23:03] [I] Device: 0\n", - "[06/09/2021-20:23:03] [I] DLACore: \n", - "[06/09/2021-20:23:03] [I] Plugins:\n", - "[06/09/2021-20:23:03] [I] === Inference Options ===\n", - "[06/09/2021-20:23:03] [I] Batch: Explicit\n", - "[06/09/2021-20:23:03] [I] Input inference shapes: model\n", - "[06/09/2021-20:23:03] [I] Iterations: 10\n", - "[06/09/2021-20:23:03] [I] Duration: 3s (+ 200ms warm up)\n", - "[06/09/2021-20:23:03] [I] Sleep time: 0ms\n", - "[06/09/2021-20:23:03] [I] Streams: 1\n", - "[06/09/2021-20:23:03] [I] ExposeDMA: Disabled\n", - "[06/09/2021-20:23:03] [I] Data transfers: Enabled\n", - "[06/09/2021-20:23:03] [I] Spin-wait: Disabled\n", - "[06/09/2021-20:23:03] [I] Multithreading: Disabled\n", - "[06/09/2021-20:23:03] [I] CUDA Graph: Disabled\n", - "[06/09/2021-20:23:03] [I] Separate profiling: Disabled\n", - "[06/09/2021-20:23:03] [I] Skip inference: Disabled\n", - "[06/09/2021-20:23:03] [I] Inputs:\n", - "[06/09/2021-20:23:03] [I] === Reporting Options ===\n", - "[06/09/2021-20:23:03] [I] Verbose: Disabled\n", - "[06/09/2021-20:23:03] [I] Averages: 10 inferences\n", - "[06/09/2021-20:23:03] [I] Percentile: 99\n", - "[06/09/2021-20:23:03] [I] Dump refittable layers:Disabled\n", - "[06/09/2021-20:23:03] [I] Dump output: Disabled\n", - "[06/09/2021-20:23:03] [I] Profile: Disabled\n", - "[06/09/2021-20:23:03] [I] Export timing to JSON file: \n", - "[06/09/2021-20:23:03] [I] Export output to JSON file: \n", - "[06/09/2021-20:23:03] [I] Export profile to JSON file: \n", - "[06/09/2021-20:23:03] [I] \n", - "[06/09/2021-20:23:04] [I] === Device Information ===\n", - "[06/09/2021-20:23:04] [I] Selected Device: Tesla V100-DGXS-16GB\n", - "[06/09/2021-20:23:04] [I] Compute Capability: 7.0\n", - "[06/09/2021-20:23:04] [I] SMs: 80\n", - "[06/09/2021-20:23:04] [I] Compute Clock Rate: 1.53 GHz\n", - "[06/09/2021-20:23:04] [I] Device Global Memory: 16155 MiB\n", - "[06/09/2021-20:23:04] [I] Shared Memory per SM: 96 KiB\n", - "[06/09/2021-20:23:04] [I] Memory Bus Width: 4096 bits (ECC enabled)\n", - "[06/09/2021-20:23:04] [I] Memory Clock Rate: 0.877 GHz\n", - "[06/09/2021-20:23:04] [I] \n", - "[06/09/2021-20:23:20] [I] [TRT] ----------------------------------------------------------------\n", - "[06/09/2021-20:23:20] [I] [TRT] Input filename: resnet50_pytorch.onnx\n", - "[06/09/2021-20:23:20] [I] [TRT] ONNX IR version: 0.0.6\n", - "[06/09/2021-20:23:20] [I] [TRT] Opset version: 9\n", - "[06/09/2021-20:23:20] [I] [TRT] Producer name: pytorch\n", - "[06/09/2021-20:23:20] [I] [TRT] Producer version: 1.9\n", - "[06/09/2021-20:23:20] [I] [TRT] Domain: \n", - "[06/09/2021-20:23:20] [I] [TRT] Model version: 0\n", - "[06/09/2021-20:23:20] [I] [TRT] Doc string: \n", - "[06/09/2021-20:23:20] [I] [TRT] ----------------------------------------------------------------\n", - "[06/09/2021-20:23:24] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.\n", - "[06/09/2021-20:24:49] [I] [TRT] Detected 1 inputs and 1 output network tensors.\n", - "[06/09/2021-20:24:49] [I] Engine built in 105.672 sec.\n", - "[06/09/2021-20:24:50] [I] Starting inference\n", - "[06/09/2021-20:24:53] [I] Warmup completed 0 queries over 200 ms\n", - "[06/09/2021-20:24:53] [I] Timing trace has 0 queries over 2.9909 s\n", - "[06/09/2021-20:24:53] [I] Trace averages of 10 runs:\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.35326 ms - Host latency: 6.18286 ms (end to end 10.1932 ms, enqueue 0.460231 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.35654 ms - Host latency: 6.19131 ms (end to end 10.2018 ms, enqueue 0.473865 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.38982 ms - Host latency: 6.22551 ms (end to end 10.2071 ms, enqueue 0.460098 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.3761 ms - Host latency: 6.24244 ms (end to end 10.2638 ms, enqueue 0.456512 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.36218 ms - Host latency: 6.22775 ms (end to end 9.37773 ms, enqueue 0.441846 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.35991 ms - Host latency: 6.22073 ms (end to end 9.77996 ms, enqueue 0.443829 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.38082 ms - Host latency: 6.25148 ms (end to end 10.0299 ms, enqueue 0.44693 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.39341 ms - Host latency: 6.26748 ms (end to end 10.0738 ms, enqueue 0.456384 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.38766 ms - Host latency: 6.26089 ms (end to end 10.2009 ms, enqueue 0.461377 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.37385 ms - Host latency: 6.24359 ms (end to end 9.65547 ms, enqueue 0.442078 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.35819 ms - Host latency: 6.21615 ms (end to end 8.21369 ms, enqueue 0.436646 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.34844 ms - Host latency: 6.20999 ms (end to end 9.77367 ms, enqueue 0.433765 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.35132 ms - Host latency: 6.21758 ms (end to end 10.6213 ms, enqueue 0.435864 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.36421 ms - Host latency: 6.23065 ms (end to end 10.5457 ms, enqueue 0.436438 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.39054 ms - Host latency: 6.25834 ms (end to end 10.4534 ms, enqueue 0.444727 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.36874 ms - Host latency: 6.23105 ms (end to end 8.89895 ms, enqueue 0.443665 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.35729 ms - Host latency: 6.21859 ms (end to end 8.51741 ms, enqueue 0.437866 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.33851 ms - Host latency: 6.19753 ms (end to end 9.1334 ms, enqueue 0.438574 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.34199 ms - Host latency: 6.21041 ms (end to end 10.6064 ms, enqueue 0.44613 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.33002 ms - Host latency: 6.20233 ms (end to end 10.5858 ms, enqueue 0.458911 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.38256 ms - Host latency: 6.25411 ms (end to end 9.77722 ms, enqueue 0.460205 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.3837 ms - Host latency: 6.2543 ms (end to end 9.4882 ms, enqueue 0.448364 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.35146 ms - Host latency: 6.20986 ms (end to end 8.36691 ms, enqueue 0.434412 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.34351 ms - Host latency: 6.20732 ms (end to end 10.1922 ms, enqueue 0.439209 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.3502 ms - Host latency: 6.21951 ms (end to end 10.6236 ms, enqueue 0.451489 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.34368 ms - Host latency: 6.21904 ms (end to end 10.4949 ms, enqueue 0.462231 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.33777 ms - Host latency: 6.21189 ms (end to end 9.99021 ms, enqueue 0.455859 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.33193 ms - Host latency: 6.19707 ms (end to end 9.02058 ms, enqueue 0.445972 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.33115 ms - Host latency: 6.19114 ms (end to end 9.11257 ms, enqueue 0.433862 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.34673 ms - Host latency: 6.21465 ms (end to end 10.6074 ms, enqueue 0.442139 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.38572 ms - Host latency: 6.25532 ms (end to end 10.3253 ms, enqueue 0.446631 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.36335 ms - Host latency: 6.23845 ms (end to end 10.6406 ms, enqueue 0.45625 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.36877 ms - Host latency: 6.24153 ms (end to end 10.2023 ms, enqueue 0.449341 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.36023 ms - Host latency: 6.21748 ms (end to end 8.45557 ms, enqueue 0.436719 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.34392 ms - Host latency: 6.20728 ms (end to end 10.1899 ms, enqueue 0.438428 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.34636 ms - Host latency: 6.21821 ms (end to end 10.6184 ms, enqueue 0.447217 ms)\n", - "[06/09/2021-20:24:53] [I] Average on 10 runs - GPU latency: 5.33555 ms - Host latency: 6.20952 ms (end to end 10.5899 ms, enqueue 0.459546 ms)\n", - "[06/09/2021-20:24:53] [I] Host Latency\n", - "[06/09/2021-20:24:53] [I] min: 6.16092 ms (end to end 6.17383 ms)\n", - "[06/09/2021-20:24:53] [I] max: 6.2887 ms (end to end 10.8184 ms)\n", - "[06/09/2021-20:24:53] [I] mean: 6.22352 ms (end to end 9.90214 ms)\n", - "[06/09/2021-20:24:53] [I] median: 6.22021 ms (end to end 10.6108 ms)\n", - "[06/09/2021-20:24:53] [I] percentile: 6.28583 ms at 99% (end to end 10.7902 ms at 99%)\n", - "[06/09/2021-20:24:53] [I] throughput: 0 qps\n", - "[06/09/2021-20:24:53] [I] walltime: 2.9909 s\n", - "[06/09/2021-20:24:53] [I] Enqueue Time\n", - "[06/09/2021-20:24:53] [I] min: 0.424072 ms\n", - "[06/09/2021-20:24:53] [I] max: 0.49585 ms\n", - "[06/09/2021-20:24:53] [I] median: 0.445618 ms\n", - "[06/09/2021-20:24:53] [I] GPU Compute\n", - "[06/09/2021-20:24:53] [I] min: 5.30127 ms\n", - "[06/09/2021-20:24:53] [I] max: 5.42108 ms\n", - "[06/09/2021-20:24:53] [I] mean: 5.35895 ms\n", - "[06/09/2021-20:24:53] [I] median: 5.35571 ms\n", - "[06/09/2021-20:24:53] [I] percentile: 5.41693 ms at 99%\n", - "[06/09/2021-20:24:53] [I] total compute time: 2.00961 s\n", - "&&&& PASSED TensorRT.trtexec # trtexec --onnx=resnet50_pytorch.onnx --saveEngine=resnet_engine_pytorch.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16\n" - ] - } - ], - "source": [ - "# step out of Python for a moment to convert the ONNX model to a TRT engine using trtexec\n", - "if USE_FP16:\n", - " !trtexec --onnx=resnet50_pytorch.onnx --saveEngine=resnet_engine_pytorch.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16\n", - "else:\n", - " !trtexec --onnx=resnet50_pytorch.onnx --saveEngine=resnet_engine_pytorch.trt --explicitBatch" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This will save our model as 'resnet_engine.trt'." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. What TensorRT runtime am I targeting?\n", - "\n", - "Now, we have a converted our model to a TensorRT engine. Great! That means we are ready to load it into the native Python TensorRT runtime. This runtime strikes a balance between the ease of use of the high level Python APIs used in frameworks and the fast, low level C++ runtimes available in TensorRT." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 15.9 s, sys: 556 ms, total: 16.5 s\n", - "Wall time: 19.3 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "import tensorrt as trt\n", - "import pycuda.driver as cuda\n", - "import pycuda.autoinit\n", - "\n", - "f = open(\"resnet_engine_pytorch.trt\", \"rb\")\n", - "runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) \n", - "\n", - "engine = runtime.deserialize_cuda_engine(f.read())\n", - "context = engine.create_execution_context()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now allocate input and output memory, give TRT pointers (bindings) to it:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "# need to set input and output precisions to FP16 to fully enable it\n", - "output = np.empty([BATCH_SIZE, 1000], dtype = target_dtype) \n", - "\n", - "# allocate device memory\n", - "d_input = cuda.mem_alloc(1 * input_batch.nbytes)\n", - "d_output = cuda.mem_alloc(1 * output.nbytes)\n", - "\n", - "bindings = [int(d_input), int(d_output)]\n", - "\n", - "stream = cuda.Stream()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, set up the prediction function.\n", - "\n", - "This involves a copy from CPU RAM to GPU VRAM, executing the model, then copying the results back from GPU VRAM to CPU RAM:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "def predict(batch): # result gets copied into output\n", - " # transfer input data to device\n", - " cuda.memcpy_htod_async(d_input, batch, stream)\n", - " # execute model\n", - " context.execute_async_v2(bindings, stream.handle, None)\n", - " # transfer predictions back\n", - " cuda.memcpy_dtoh_async(output, d_output, stream)\n", - " # syncronize threads\n", - " stream.synchronize()\n", - " \n", - " return output" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's time the function!" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Warming up...\n", - "Done warming up!\n" - ] - } - ], - "source": [ - "print(\"Warming up...\")\n", - "\n", - "pred = predict(preprocessed_images)\n", - "\n", - "print(\"Done warming up!\")" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "6.28 ms ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "pred = predict(preprocessed_images)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally we should verify our TensorRT output is still accurate." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Class | Probability (out of 1)\n" - ] - }, - { - "data": { - "text/plain": [ - "[(207, 12.44), (208, 7.508), (220, 7.492), (160, 7.426), (226, 7.383)]" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "indices = (-pred[0]).argsort()[:5]\n", - "print(\"Class | Probability (out of 1)\")\n", - "list(zip(indices, pred[0][indices]))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Look for ImageNet indices 150-275 above, where 207 is the ground truth correct class (Golden Retriever). Compare with the results of the original unoptimized model in the first section!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Next Steps:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "

Profiling

\n", - "\n", - "This is a great next step for further optimizing and debugging models you are working on productionizing\n", - "\n", - "You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html\n", - "\n", - "

TRT Dev Docs

\n", - "\n", - "Main documentation page for the ONNX, layer builder, C++, and legacy APIs\n", - "\n", - "You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html\n", - "\n", - "

TRT OSS GitHub

\n", - "\n", - "Contains OSS TRT components, sample applications, and plugin examples\n", - "\n", - "You can find it here: https://github.com/NVIDIA/TensorRT\n", - "\n", - "\n", - "#### TRT Supported Layers:\n", - "\n", - "https://github.com/NVIDIA/TensorRT/tree/main/samples/opensource/samplePlugin\n", - "\n", - "#### TRT ONNX Plugin Example:\n", - "\n", - "https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#layers-precision-matrix" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.8" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/quickstart/IntroNotebooks/5. Understanding TensorRT Runtimes.ipynb b/quickstart/IntroNotebooks/5. Understanding TensorRT Runtimes.ipynb deleted file mode 100644 index 05e8b3df..00000000 --- a/quickstart/IntroNotebooks/5. Understanding TensorRT Runtimes.ipynb +++ /dev/null @@ -1,107 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Runtimes: What are my options? How do I choose?\n", - "\n", - "Remember that TensorRT consists of two main components - __1. A series of parsers and integrations__ to convert your model to an optimized engine and __2. An series of TensorRT runtime APIs__ with several associated tools for deployment.\n", - "\n", - "In this notebook, we will focus on the latter - various runtime options for TensorRT engines.\n", - "\n", - "The runtimes have different use cases for running TRT engines. \n", - "\n", - "### Considerations when picking a runtime:\n", - "\n", - "Generally speaking, there are a few major considerations when picking a runtime:\n", - "- __Framework__ - Some options, like TF-TRT, are only relevant to Tensorflow\n", - "- __Time-to-solution__ - TF-TRT is much more likely to work 'out-of-the-box' if a quick solution is required and ONNX fails\n", - "- __Serving needs__ - TF-TRT can use TF Serving to serve models over HTTP as a simple solution. For other frameworks (or for more advanced features) TRITON is framework agnostic, allows for concurrent model execution or multiple copies within a GPU to reduce latency, and can accept engines created through both the ONNX and TF-TRT paths\n", - "- __Performance__ - Different TensorRT runtimes offer varying levels of performance. For example, TF-TRT is generally going to be slower than using ONNX or the C++ API directly.\n", - "\n", - "### Python API:\n", - "\n", - "__Use this when:__\n", - "- You can accept some performance overhead, and\n", - "- You are most familiar with Python, or\n", - "- You are performing initial debugging and testing with TRT\n", - "\n", - "__More info:__\n", - "\n", - " \n", - "The [TensorRT Python API](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#perform_inference_python) gives you fine-grained control over the execution of your engine using a Python interface. It makes memory allocation, kernel execution, and copies to and from the GPU explicit - which can make integration into high performance applications easier. It is also great for testing models in a Python environment - such as in a Jupyter notebook.\n", - " \n", - "The [ONNX notebook for Tensorflow](./3.%20Using%20Tensorflow%202%20through%20ONNX.ipynb) and [for PyTorch](./4.%20Using%20PyTorch%20through%20ONNX.ipynb) are good examples of using TensorRT to get great performance while staying in Python\n", - "\n", - "### C++ API: \n", - "\n", - "__Use this when:__\n", - "- You want the least amount of overhead possible to maximize the performance of your models and achieve better latency\n", - "- You are not using TF-TRT (though TF-TRT graph conversions that only generate a single engine can still be exported to C++)\n", - "- You are most familiar with C++\n", - "- You want to optimize your inference pipeline as much as possible\n", - "\n", - "__More info:__\n", - "\n", - "The [TensorRT C++ API](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#perform_inference_c) gives you fine-grained control over the execution of your engine using a C++ interface. It makes memory allocation, kernel execution, and copies to and from the GPU explicit - which can make integration into high performance C++ applications easier. The C++ API is generally the most performant option for running TensorRT engines, with the least overhead.\n", - "\n", - "[This NVIDIA Developer blog](https://developer.nvidia.com/blog/speed-up-inference-tensorrt/) is a good example of taking an ONNX model and running it with dynamic batch size support using the C++ API.\n", - "\n", - "\n", - "### Tensorflow/TF-TRT Runtime: (Tensorflow Only) \n", - " \n", - "__Use this when:__\n", - " \n", - "- You are using TF-TRT, and\n", - "- Your model converts to more than one TensorRT engine\n", - "\n", - "__More info:__\n", - "\n", - "\n", - "TF-TRT is the standard runtime used with models that were converted in TF-TRT. It works by taking groups of nodes at once in the Tensorflow graph, and replacing them with a singular optimized engine that calls the TensorRT Python API behind the scenes. This optimized engine is in the form of a Tensorflow operation - which means that your graph is still in Tensorflow and will essentially function like any other Tensorflow model. For example, it can be a useful exercise to take a look at your model in Tensorboard to validate which nodes TensorRT was able to optimize.\n", - "\n", - "If your graph entirely converts to a single TF-TRT engine, it can be more efficient to export the engine node and run it using one of the other APIs. You can find instructions to do this in the [TF-TRT documentation](https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#tensorrt-plan).\n", - "\n", - "As an example, the TF-TRT notebooks included with this guide use the TF-TRT runtime.\n", - "\n", - "### TRITON Inference Server\n", - "\n", - "__Use this when:__\n", - "- You want to serve your models over HTTP or gRPC\n", - "- You want to load balance across multiple models or copies of models across GPUs to minimze latency and make better use of the GPU\n", - "- You want to have multiple models running efficiently on a single GPU at the same time\n", - "- You want to serve a variety of models converted using a variety of converters and frameworks (including TF-TRT and ONNX) through a uniform interface\n", - "- You need serving support but are using PyTorch, another framework, or the ONNX path in general\n", - "\n", - "__More info:__\n", - "\n", - "\n", - "TRITON is an open source inference serving software that lets teams deploy trained AI models from any framework (TensorFlow, TensorRT, PyTorch, ONNX Runtime, or a custom framework), from local storage or Google Cloud Platform or AWS S3 on any GPU- or CPU-based infrastructure (cloud, data center, or edge). It is a flexible project with several unique features - such as concurrent model execution of both heterogeneous models and multiple copies of the same model (multiple model copies can reduce latency further) as well as load balancing and model analysis. It is a good option if you need to serve your models over HTTP - such as in a cloud inferencing solution.\n", - " \n", - "You can find the TRITON home page [here](https://developer.nvidia.com/nvidia-triton-inference-server), and the documentation [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/)." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.9" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/quickstart/IntroNotebooks/Additional Examples/1. TF-TRT Classification.ipynb b/quickstart/IntroNotebooks/Additional Examples/1. TF-TRT Classification.ipynb deleted file mode 100644 index 4fa0bab1..00000000 --- a/quickstart/IntroNotebooks/Additional Examples/1. TF-TRT Classification.ipynb +++ /dev/null @@ -1,711 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# TF-TRT Keras Classification Examples:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this notebook, we cover a variety of classification base networks pulled from the tensorflow.keras.applications project!\n", - "\n", - "This demonstrates TF-TRT working on a variety of model architectures out of the box. This is a great way to demonstrate the ease of use of TF-TRT. TF-TRT can still optimize parts of your network even if it contains layers that are not supported by TensorRT itself. This makes it easy to get a first-pass at an optimized model - as we will demonstrate here." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's make sure our GPUs are properly configured and visible:" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Fri Jan 29 22:55:18 2021 \n", - "+-----------------------------------------------------------------------------+\n", - "| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.1 |\n", - "|-------------------------------+----------------------+----------------------+\n", - "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", - "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", - "| | | MIG M. |\n", - "|===============================+======================+======================|\n", - "| 0 Tesla V100-DGXS... On | 00000000:07:00.0 Off | 0 |\n", - "| N/A 43C P0 62W / 300W | 125MiB / 16155MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 |\n", - "| N/A 42C P0 38W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 2 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 |\n", - "| N/A 41C P0 38W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 3 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 |\n", - "| N/A 42C P0 37W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - " \n", - "+-----------------------------------------------------------------------------+\n", - "| Processes: |\n", - "| GPU GI CI PID Type Process name GPU Memory |\n", - "| ID ID Usage |\n", - "|=============================================================================|\n", - "+-----------------------------------------------------------------------------+\n" - ] - } - ], - "source": [ - "!nvidia-smi" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Remember to sucessfully deploy a TensorRT model, you have to make __five key decisions__:\n", - "\n", - "1. __What format should I save my model in?__\n", - "2. __What batch size(s) am I running inference at?__\n", - "3. __What precision am I running inference at?__\n", - "4. __What TensorRT path am I using to convert my model?__\n", - "5. __What runtime am I targeting?__\n", - "\n", - "Let's get to it!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. What format should I save my model in?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "TF-TRT requires SavedModel format in Tensorflow 2.x:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "!mkdir -p tmp_savedmodels" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "downloading and initializing models...\n", - "Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5\n", - "553467904/553467096 [==============================] - 73s 0us/step\n", - "Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels.h5\n", - "96116736/96112376 [==============================] - 3s 0us/step\n", - "Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/xception/xception_weights_tf_dim_ordering_tf_kernels.h5\n", - "91889664/91884032 [==============================] - 5s 0us/step\n", - "Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224.h5\n", - "14540800/14536120 [==============================] - 1s 0us/step\n", - "Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/densenet/densenet121_weights_tf_dim_ordering_tf_kernels.h5\n", - "33193984/33188688 [==============================] - 1s 0us/step\n", - "Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50v2_weights_tf_dim_ordering_tf_kernels.h5\n", - "102875136/102869336 [==============================] - 10s 0us/step\n", - "saving ...\n", - "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.\n", - "Instructions for updating:\n", - "This property should not be used in TensorFlow 2.0, as updates are applied automatically.\n", - "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.\n", - "Instructions for updating:\n", - "This property should not be used in TensorFlow 2.0, as updates are applied automatically.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/0/assets\n", - "finished saving!\n", - "saving ...\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/1/assets\n", - "finished saving!\n", - "saving ...\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/2/assets\n", - "finished saving!\n", - "saving ...\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/3/assets\n", - "finished saving!\n", - "saving ...\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/4/assets\n", - "finished saving!\n", - "saving ...\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/5/assets\n", - "finished saving!\n", - "saving ...\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/6/assets\n", - "finished saving!\n" - ] - } - ], - "source": [ - "from tensorflow.keras.applications import ResNet50, VGG16, InceptionV3, Xception, MobileNetV2, DenseNet121, ResNet50V2\n", - "\n", - "print(\"Downloading and initializing models...\")\n", - "models = [ResNet50, VGG16, InceptionV3, Xception, MobileNetV2, DenseNet121, ResNet50V2]\n", - "models = [model(include_top=True, weights='imagenet') for model in models]\n", - "\n", - "model_dirs = []\n", - "for idx, model in enumerate(models):\n", - " print(\"Saving\", model,\"...\")\n", - " model_dir = 'tmp_savedmodels/%s' % idx\n", - " model_dirs.append(model_dir)\n", - " model.save(model_dir) \n", - " print(\"Finished saving!\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. What batch size(s) am I running inference at?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will use a batch size of 32 for all models:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "BATCH_SIZE = 32" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We create a series of randomized \"dummy\" batches to test our model on:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "dummy_input_batch = lambda x: np.zeros((BATCH_SIZE, x, x, 3))\n", - "\n", - "dummy_inputs = [224, 224, 299, 299, 224, 224, 224]\n", - "dummy_inputs = [dummy_input_batch(size) for size in dummy_inputs]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Last, we \"warm up\" all of our models so their one time start-up costs aren't throw off any of our Jupyter magic %%timeit timer calls:" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:5 out of the last 5 calls to .predict_function at 0x7f59d856e488> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.\n", - "WARNING:tensorflow:6 out of the last 6 calls to .predict_function at 0x7f5b3820d6a8> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.\n", - "WARNING:tensorflow:7 out of the last 7 calls to .predict_function at 0x7f596ca93950> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.\n" - ] - } - ], - "source": [ - "# Warm up:\n", - "for idx, model in enumerate(models):\n", - " model.predict(dummy_inputs[idx])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. What precision am I running inference at?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will leave it as the default:" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "PRECISION = \"FP32\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. What TensorRT tool or integration am I using to convert my model?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will be using TF-TRT through the ModelOptimizer example wrapper used in this guide:" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Starting resnet50 tmp_savedmodels/0\n", - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_0 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/0_FP32/assets\n", - "[[1.6964252e-04 3.3007402e-04 6.1350249e-05 ... 1.4622317e-05\n", - " 1.4449877e-04 6.6086568e-04]\n", - " [1.6964252e-04 3.3007402e-04 6.1350249e-05 ... 1.4622317e-05\n", - " 1.4449877e-04 6.6086568e-04]\n", - " [1.6964252e-04 3.3007402e-04 6.1350249e-05 ... 1.4622317e-05\n", - " 1.4449877e-04 6.6086568e-04]\n", - " ...\n", - " [1.6964252e-04 3.3007402e-04 6.1350249e-05 ... 1.4622317e-05\n", - " 1.4449877e-04 6.6086568e-04]\n", - " [1.6964252e-04 3.3007402e-04 6.1350249e-05 ... 1.4622317e-05\n", - " 1.4449877e-04 6.6086568e-04]\n", - " [1.6964252e-04 3.3007402e-04 6.1350249e-05 ... 1.4622317e-05\n", - " 1.4449877e-04 6.6086568e-04]]\n", - "Finished!\n", - "\n", - "Starting vgg16 tmp_savedmodels/1\n", - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Could not find TRTEngineOp_1_0 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/1_FP32/assets\n", - "[[0.00022801 0.00222478 0.00050746 ... 0.00011863 0.00026599 0.01312881]\n", - " [0.00022801 0.00222478 0.00050746 ... 0.00011863 0.00026599 0.01312881]\n", - " [0.00022801 0.00222478 0.00050746 ... 0.00011863 0.00026599 0.01312881]\n", - " ...\n", - " [0.00022801 0.00222478 0.00050746 ... 0.00011863 0.00026599 0.01312881]\n", - " [0.00022801 0.00222478 0.00050746 ... 0.00011863 0.00026599 0.01312881]\n", - " [0.00022801 0.00222478 0.00050746 ... 0.00011863 0.00026599 0.01312881]]\n", - "Finished!\n", - "\n", - "Starting inception_v3 tmp_savedmodels/2\n", - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Could not find TRTEngineOp_2_0 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/2_FP32/assets\n", - "[[0.00043102 0.00033233 0.0002535 ... 0.00012701 0.00023254 0.00082577]\n", - " [0.00043102 0.00033233 0.0002535 ... 0.00012701 0.00023254 0.00082577]\n", - " [0.00043102 0.00033233 0.0002535 ... 0.00012701 0.00023254 0.00082577]\n", - " ...\n", - " [0.00043102 0.00033233 0.0002535 ... 0.00012701 0.00023254 0.00082577]\n", - " [0.00043102 0.00033233 0.0002535 ... 0.00012701 0.00023254 0.00082577]\n", - " [0.00043102 0.00033233 0.0002535 ... 0.00012701 0.00023254 0.00082577]]\n", - "Finished!\n", - "\n", - "Starting xception tmp_savedmodels/3\n", - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Could not find TRTEngineOp_3_0 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/3_FP32/assets\n", - "[[0.00022673 0.00034859 0.00021873 ... 0.00012943 0.00032854 0.00086526]\n", - " [0.00022673 0.00034859 0.00021873 ... 0.00012943 0.00032854 0.00086526]\n", - " [0.00022673 0.00034859 0.00021873 ... 0.00012943 0.00032854 0.00086526]\n", - " ...\n", - " [0.00022673 0.00034859 0.00021873 ... 0.00012943 0.00032854 0.00086526]\n", - " [0.00022673 0.00034859 0.00021873 ... 0.00012943 0.00032854 0.00086526]\n", - " [0.00022673 0.00034859 0.00021873 ... 0.00012943 0.00032854 0.00086526]]\n", - "Finished!\n", - "\n", - "Starting mobilenetv2_1.00_224 tmp_savedmodels/4\n", - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Could not find TRTEngineOp_4_0 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/4_FP32/assets\n", - "[[1.8110585e-04 6.4528472e-04 6.8695762e-04 ... 7.9570833e-05\n", - " 1.3486181e-04 3.3463116e-03]\n", - " [1.8110585e-04 6.4528472e-04 6.8695762e-04 ... 7.9570833e-05\n", - " 1.3486181e-04 3.3463116e-03]\n", - " [1.8110585e-04 6.4528472e-04 6.8695762e-04 ... 7.9570833e-05\n", - " 1.3486181e-04 3.3463116e-03]\n", - " ...\n", - " [1.8110585e-04 6.4528472e-04 6.8695762e-04 ... 7.9570833e-05\n", - " 1.3486181e-04 3.3463116e-03]\n", - " [1.8110585e-04 6.4528472e-04 6.8695762e-04 ... 7.9570833e-05\n", - " 1.3486181e-04 3.3463116e-03]\n", - " [1.8110585e-04 6.4528472e-04 6.8695762e-04 ... 7.9570833e-05\n", - " 1.3486181e-04 3.3463116e-03]]\n", - "Finished!\n", - "\n", - "Starting densenet121 tmp_savedmodels/5\n", - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Could not find TRTEngineOp_5_0 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/5_FP32/assets\n", - "[[2.3581024e-04 3.7533988e-04 1.1308040e-04 ... 5.6219425e-05\n", - " 2.6299071e-04 1.1581751e-03]\n", - " [2.3581024e-04 3.7533988e-04 1.1308040e-04 ... 5.6219425e-05\n", - " 2.6299071e-04 1.1581751e-03]\n", - " [2.3581024e-04 3.7533988e-04 1.1308040e-04 ... 5.6219425e-05\n", - " 2.6299071e-04 1.1581751e-03]\n", - " ...\n", - " [2.3581024e-04 3.7533988e-04 1.1308040e-04 ... 5.6219425e-05\n", - " 2.6299071e-04 1.1581751e-03]\n", - " [2.3581024e-04 3.7533988e-04 1.1308040e-04 ... 5.6219425e-05\n", - " 2.6299071e-04 1.1581751e-03]\n", - " [2.3581024e-04 3.7533988e-04 1.1308040e-04 ... 5.6219425e-05\n", - " 2.6299071e-04 1.1581751e-03]]\n", - "Finished!\n", - "\n", - "Starting resnet50v2 tmp_savedmodels/6\n", - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Could not find TRTEngineOp_6_0 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/6_FP32/assets\n", - "[[0.00082353 0.00079469 0.00060477 ... 0.00036948 0.00069747 0.00154858]\n", - " [0.00082353 0.00079469 0.00060477 ... 0.00036948 0.00069747 0.00154858]\n", - " [0.00082353 0.00079469 0.00060477 ... 0.00036948 0.00069747 0.00154858]\n", - " ...\n", - " [0.00082353 0.00079469 0.00060477 ... 0.00036948 0.00069747 0.00154858]\n", - " [0.00082353 0.00079469 0.00060477 ... 0.00036948 0.00069747 0.00154858]\n", - " [0.00082353 0.00079469 0.00060477 ... 0.00036948 0.00069747 0.00154858]]\n", - "Finished!\n", - "\n" - ] - } - ], - "source": [ - "from helper import ModelOptimizer\n", - "\n", - "opt_models = []\n", - "for model_class, model, dummy in zip(models, model_dirs, dummy_inputs):\n", - " print(\"Starting\", model_class._name, model)\n", - " model_opt = ModelOptimizer(model)\n", - " opt_trt = model_opt.convert(model+'_'+PRECISION, precision=PRECISION)\n", - "\n", - " print(opt_trt.predict(dummy))\n", - " \n", - " opt_models.append(opt_trt)\n", - " \n", - " print(\"Finished!\\n\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. What TensorRT runtime am I targeting?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will stay inside our Tensorflow/Python runtime:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([[0.00082353, 0.00079469, 0.00060477, ..., 0.00036948, 0.00069747,\n", - " 0.00154858],\n", - " [0.00082353, 0.00079469, 0.00060477, ..., 0.00036948, 0.00069747,\n", - " 0.00154858],\n", - " [0.00082353, 0.00079469, 0.00060477, ..., 0.00036948, 0.00069747,\n", - " 0.00154858],\n", - " ...,\n", - " [0.00082353, 0.00079469, 0.00060477, ..., 0.00036948, 0.00069747,\n", - " 0.00154858],\n", - " [0.00082353, 0.00079469, 0.00060477, ..., 0.00036948, 0.00069747,\n", - " 0.00154858],\n", - " [0.00082353, 0.00079469, 0.00060477, ..., 0.00036948, 0.00069747,\n", - " 0.00154858]], dtype=float32)" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "opt_models[idx].predict(dummy_inputs[idx])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Performance Comparisons:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "idx = 0 #resnet" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 160 ms, sys: 5.52 ms, total: 166 ms\n", - "Wall time: 148 ms\n" - ] - }, - { - "data": { - "text/plain": [ - "array([[1.69642386e-04, 3.30075040e-04, 6.13506127e-05, ...,\n", - " 1.46224065e-05, 1.44499005e-04, 6.60870341e-04],\n", - " [1.69642386e-04, 3.30075040e-04, 6.13506127e-05, ...,\n", - " 1.46224065e-05, 1.44499005e-04, 6.60870341e-04],\n", - " [1.69642386e-04, 3.30075040e-04, 6.13506127e-05, ...,\n", - " 1.46224065e-05, 1.44499005e-04, 6.60870341e-04],\n", - " ...,\n", - " [1.69642386e-04, 3.30075040e-04, 6.13506127e-05, ...,\n", - " 1.46224065e-05, 1.44499005e-04, 6.60870341e-04],\n", - " [1.69642386e-04, 3.30075040e-04, 6.13506127e-05, ...,\n", - " 1.46224065e-05, 1.44499005e-04, 6.60870341e-04],\n", - " [1.69642386e-04, 3.30075040e-04, 6.13506127e-05, ...,\n", - " 1.46224065e-05, 1.44499005e-04, 6.60870341e-04]], dtype=float32)" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "models[idx].predict(dummy_inputs[idx])" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 30.2 ms, sys: 8.3 ms, total: 38.5 ms\n", - "Wall time: 36.6 ms\n" - ] - }, - { - "data": { - "text/plain": [ - "array([[1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " ...,\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04],\n", - " [1.6964252e-04, 3.3007402e-04, 6.1350249e-05, ..., 1.4622317e-05,\n", - " 1.4449877e-04, 6.6086568e-04]], dtype=float32)" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "opt_models[idx].predict(dummy_inputs[idx])" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "idx = -3 # mobilenets" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 105 ms, sys: 14.4 ms, total: 120 ms\n", - "Wall time: 63.5 ms\n" - ] - }, - { - "data": { - "text/plain": [ - "array([[1.8110899e-04, 6.4530974e-04, 6.8695901e-04, ..., 7.9570033e-05,\n", - " 1.3486811e-04, 3.3462986e-03],\n", - " [1.8110899e-04, 6.4530974e-04, 6.8695901e-04, ..., 7.9570033e-05,\n", - " 1.3486811e-04, 3.3462986e-03],\n", - " [1.8110899e-04, 6.4530974e-04, 6.8695901e-04, ..., 7.9570033e-05,\n", - " 1.3486811e-04, 3.3462986e-03],\n", - " ...,\n", - " [1.8110899e-04, 6.4530974e-04, 6.8695901e-04, ..., 7.9570033e-05,\n", - " 1.3486811e-04, 3.3462986e-03],\n", - " [1.8110899e-04, 6.4530974e-04, 6.8695901e-04, ..., 7.9570033e-05,\n", - " 1.3486811e-04, 3.3462986e-03],\n", - " [1.8110899e-04, 6.4530974e-04, 6.8695901e-04, ..., 7.9570033e-05,\n", - " 1.3486811e-04, 3.3462986e-03]], dtype=float32)" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "models[idx].predict(dummy_inputs[idx])" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 19.9 ms, sys: 4.48 ms, total: 24.4 ms\n", - "Wall time: 22.4 ms\n" - ] - }, - { - "data": { - "text/plain": [ - "array([[1.8110585e-04, 6.4528472e-04, 6.8695762e-04, ..., 7.9570833e-05,\n", - " 1.3486181e-04, 3.3463116e-03],\n", - " [1.8110585e-04, 6.4528472e-04, 6.8695762e-04, ..., 7.9570833e-05,\n", - " 1.3486181e-04, 3.3463116e-03],\n", - " [1.8110585e-04, 6.4528472e-04, 6.8695762e-04, ..., 7.9570833e-05,\n", - " 1.3486181e-04, 3.3463116e-03],\n", - " ...,\n", - " [1.8110585e-04, 6.4528472e-04, 6.8695762e-04, ..., 7.9570833e-05,\n", - " 1.3486181e-04, 3.3463116e-03],\n", - " [1.8110585e-04, 6.4528472e-04, 6.8695762e-04, ..., 7.9570833e-05,\n", - " 1.3486181e-04, 3.3463116e-03],\n", - " [1.8110585e-04, 6.4528472e-04, 6.8695762e-04, ..., 7.9570833e-05,\n", - " 1.3486181e-04, 3.3463116e-03]], dtype=float32)" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "opt_models[idx].predict(dummy_inputs[idx])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - } - }, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.9" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/quickstart/IntroNotebooks/Additional Examples/2. TF-TRT Detection.ipynb b/quickstart/IntroNotebooks/Additional Examples/2. TF-TRT Detection.ipynb deleted file mode 100644 index ed694d7f..00000000 --- a/quickstart/IntroNotebooks/Additional Examples/2. TF-TRT Detection.ipynb +++ /dev/null @@ -1,585 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# TF-TRT Keras Retinanet Detection Example:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this notebook, we are going to optimize a Retinanet detection model from the official Keras examples! \n", - "\n", - "You can find the implementation here: https://keras.io/examples/vision/retinanet/\n", - "\n", - "In general, detection models can be tricky to optimize because they tend to require a lot of custom logic for sub-tasks such as region proposal, output decoding, or non-maximum suppression. This makes them a good demonstration of TF-TRT's capabilities - It does a great job of optimizing a large part of the network while leaving the custom logic untouched." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's make sure our GPUs are properly configured and visible:" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Fri Jan 29 23:17:01 2021 \n", - "+-----------------------------------------------------------------------------+\n", - "| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.1 |\n", - "|-------------------------------+----------------------+----------------------+\n", - "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", - "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", - "| | | MIG M. |\n", - "|===============================+======================+======================|\n", - "| 0 Tesla V100-DGXS... On | 00000000:07:00.0 Off | 0 |\n", - "| N/A 42C P0 37W / 300W | 125MiB / 16155MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 |\n", - "| N/A 43C P0 38W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 2 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 |\n", - "| N/A 42C P0 38W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 3 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 |\n", - "| N/A 43C P0 37W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - " \n", - "+-----------------------------------------------------------------------------+\n", - "| Processes: |\n", - "| GPU GI CI PID Type Process name GPU Memory |\n", - "| ID ID Usage |\n", - "|=============================================================================|\n", - "+-----------------------------------------------------------------------------+\n" - ] - } - ], - "source": [ - "!nvidia-smi" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will also need matplotlib to run the model. If you do not have it, run:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (3.3.4)\n", - "Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (1.17.3)\n", - "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (0.10.0)\n", - "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (1.3.1)\n", - "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (2.4.7)\n", - "Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (2.8.1)\n", - "Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (8.1.0)\n", - "Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from cycler>=0.10->matplotlib) (1.15.0)\n", - "\u001b[33mWARNING: You are using pip version 20.2.3; however, version 21.0 is available.\n", - "You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.\u001b[0m\n" - ] - } - ], - "source": [ - "!pip install matplotlib" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Remember to sucessfully deploy a TensorRT model, you have to make __five key decisions__:\n", - "\n", - "1. __What format should I save my model in?__\n", - "2. __What batch size(s) am I running inference at?__\n", - "3. __What precision am I running inference at?__\n", - "4. __What TensorRT path am I using to convert my model?__\n", - "5. __What runtime am I targeting?__\n", - "\n", - "Let's give it a shot!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. What format should I save my model in?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will work with one of the Keras example RetinaNet implementations. We can download the implementation code for the specific version of it required here:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "--2021-01-29 23:17:05-- https://raw.githubusercontent.com/keras-team/keras-io/cd6201c1bfa37625f503f51e8fd3c572666770e4/examples/vision/retinanet.py\n", - "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.40.133\n", - "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.40.133|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 35046 (34K) [text/plain]\n", - "Saving to: ‘retinanet.py’\n", - "\n", - "retinanet.py 100%[===================>] 34.22K --.-KB/s in 0.002s \n", - "\n", - "2021-01-29 23:17:05 (20.1 MB/s) - ‘retinanet.py’ saved [35046/35046]\n", - "\n" - ] - } - ], - "source": [ - "!wget -O retinanet.py https://raw.githubusercontent.com/keras-team/keras-io/cd6201c1bfa37625f503f51e8fd3c572666770e4/examples/vision/retinanet.py" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The code has some unnecessary setup steps, so we will pull out just the model implementation itself using sed (you can check the end result in the [retinanet_model.py](./retinanet_model.py) file)" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "!sed -n '1,40 p; 71,820 p' retinanet.py > retinanet_model.py" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "!mkdir -p tmp_savedmodels" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We perform some imports and setup:" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "import tensorflow as tf\n", - "from tensorflow import keras\n", - "from tensorflow.keras import layers" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "img_size = (224, 224)\n", - "num_classes = 10" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can import our necessary RetinaNet functions from the example and initialize our detection model:" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5\n", - "94773248/94765736 [==============================] - 2s 0us/step\n" - ] - } - ], - "source": [ - "from retinanet_model import RetinaNet, DecodePredictions, get_backbone\n", - "\n", - "resnet50_backbone = get_backbone()\n", - "model = RetinaNet(num_classes, resnet50_backbone)\n", - "\n", - "image = tf.keras.Input(shape=[None, None, 3], name=\"image\")\n", - "predictions = model(image, training=False)\n", - "detections = DecodePredictions(confidence_threshold=0.5)(image, predictions)\n", - "inference_model = tf.keras.Model(inputs=image, outputs=detections)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, we save our model in SavedModel format!" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.\n", - "Instructions for updating:\n", - "This property should not be used in TensorFlow 2.0, as updates are applied automatically.\n", - "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.\n", - "Instructions for updating:\n", - "This property should not be used in TensorFlow 2.0, as updates are applied automatically.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/detect_model/assets\n" - ] - } - ], - "source": [ - "model_dir = \"tmp_savedmodels/detect_model\"\n", - "model.save(model_dir) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. What batch size(s) am I running inference at?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will create a dummy batch of size 32:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "dummy_input = np.zeros((32, img_size[0], img_size[1], 3))" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "CombinedNonMaxSuppression(nmsed_boxes=array([[[0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " ...,\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.]],\n", - "\n", - " [[0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " ...,\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.]],\n", - "\n", - " [[0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " ...,\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.]],\n", - "\n", - " ...,\n", - "\n", - " [[0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " ...,\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.]],\n", - "\n", - " [[0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " ...,\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.]],\n", - "\n", - " [[0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " ...,\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.],\n", - " [0., 0., 0., 0.]]], dtype=float32), nmsed_scores=array([[0., 0., 0., ..., 0., 0., 0.],\n", - " [0., 0., 0., ..., 0., 0., 0.],\n", - " [0., 0., 0., ..., 0., 0., 0.],\n", - " ...,\n", - " [0., 0., 0., ..., 0., 0., 0.],\n", - " [0., 0., 0., ..., 0., 0., 0.],\n", - " [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), nmsed_classes=array([[0., 0., 0., ..., 0., 0., 0.],\n", - " [0., 0., 0., ..., 0., 0., 0.],\n", - " [0., 0., 0., ..., 0., 0., 0.],\n", - " ...,\n", - " [0., 0., 0., ..., 0., 0., 0.],\n", - " [0., 0., 0., ..., 0., 0., 0.],\n", - " [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), valid_detections=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", - " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32))" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "inference_model.predict(dummy_input)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. What precision am I running inference at?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will stick with the same FP32 precision used during training:" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "PRECISION = \"FP32\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. What TensorRT path am I using to convert my model?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will use our example TF-TRT based ModelOptimizer wrapper:" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "from helper import ModelOptimizer\n", - "\n", - "model_opt = ModelOptimizer(model_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Convert to our target precision, saving the result in a new SavedModel:" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_2 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_0 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_1 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_3 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/detect_model_FP32/assets\n", - "conversion complete! prediction shape: (32, 9441, 14)\n" - ] - } - ], - "source": [ - "opt_trt = model_opt.convert(model_dir+'_'+PRECISION, precision=PRECISION)\n", - "print(\"conversion complete! prediction shape:\", opt_trt.predict(dummy_input).shape)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. What TensorRT runtime am I targeting?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will stick to our TF-TRT/Tensorflow runtime:" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Warming up...\n", - "(32, 9441, 14)\n", - "(32, 9441, 14)\n", - "Done warming up!\n" - ] - } - ], - "source": [ - "print(\"Warming up...\")\n", - "\n", - "print(model.predict(dummy_input).shape)\n", - "print(opt_trt.predict(dummy_input).shape)\n", - "\n", - "print(\"Done warming up!\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Performance Comparisons:" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "109 ms ± 5.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "preds = model.predict(dummy_input)" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "45.1 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "preds = opt_trt.predict(dummy_input)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - } - }, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.9" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/quickstart/IntroNotebooks/Additional Examples/3. TF-TRT Segmentation.ipynb b/quickstart/IntroNotebooks/Additional Examples/3. TF-TRT Segmentation.ipynb deleted file mode 100644 index 5c091c79..00000000 --- a/quickstart/IntroNotebooks/Additional Examples/3. TF-TRT Segmentation.ipynb +++ /dev/null @@ -1,480 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# TF-TRT Keras UNet Segmentation Example" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this notebook, we are going to optimize a UNet-style segmentation model from the official Keras examples! \n", - "\n", - "You can find the implementation here: https://keras.io/examples/vision/oxford_pets_image_segmentation/\n", - "\n", - "Segmentation is a great demonstration for TensorRT as it tends to be very heavy on convolutional layers, which accelerate well. This is especially true of UNet, which consists entirely of convolutional layers." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's make sure our GPUs are properly configured and visible:" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Fri Jan 29 23:22:33 2021 \n", - "+-----------------------------------------------------------------------------+\n", - "| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.1 |\n", - "|-------------------------------+----------------------+----------------------+\n", - "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", - "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", - "| | | MIG M. |\n", - "|===============================+======================+======================|\n", - "| 0 Tesla V100-DGXS... On | 00000000:07:00.0 Off | 0 |\n", - "| N/A 43C P0 62W / 300W | 125MiB / 16155MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 |\n", - "| N/A 43C P0 41W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 2 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 |\n", - "| N/A 42C P0 40W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 3 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 |\n", - "| N/A 43C P0 39W / 300W | 6MiB / 16158MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - " \n", - "+-----------------------------------------------------------------------------+\n", - "| Processes: |\n", - "| GPU GI CI PID Type Process name GPU Memory |\n", - "| ID ID Usage |\n", - "|=============================================================================|\n", - "+-----------------------------------------------------------------------------+\n" - ] - } - ], - "source": [ - "!nvidia-smi" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Remember to sucessfully deploy a TensorRT model, you have to make __five key decisions__:\n", - "\n", - "1. __What format should I save my model in?__\n", - "2. __What TensorRT tool or integration am I using to convert my model?__\n", - "3. __What batch size(s) am I running inference at?__\n", - "4. __What precision am I running inference at?__\n", - "5. __What TensorRT runtime am I targeting?__\n", - "\n", - "Let's try converting our segmentation model!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. What format should I save my model in?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, some setup:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "!mkdir -p tmp_savedmodels" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "import tensorflow\n", - "from tensorflow import keras\n", - "from tensorflow.keras import layers" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, we will download a specific implementation of U-Net from the Keras examples:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "--2021-01-29 23:22:36-- https://raw.githubusercontent.com/keras-team/keras-io/cd6201c1bfa37625f503f51e8fd3c572666770e4/examples/vision/oxford_pets_image_segmentation.py\n", - "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.40.133\n", - "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.40.133|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 7422 (7.2K) [text/plain]\n", - "Saving to: ‘oxford_pets_image_segmentation.py’\n", - "\n", - "oxford_pets_image_s 100%[===================>] 7.25K 29.9KB/s in 0.2s \n", - "\n", - "2021-01-29 23:22:37 (29.9 KB/s) - ‘oxford_pets_image_segmentation.py’ saved [7422/7422]\n", - "\n" - ] - } - ], - "source": [ - "!wget -O oxford_pets_image_segmentation.py https://raw.githubusercontent.com/keras-team/keras-io/cd6201c1bfa37625f503f51e8fd3c572666770e4/examples/vision/oxford_pets_image_segmentation.py" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The example has some unnecessary setup lines, we can pull out just the model itself using sed:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "!sed -n '70,74 p; 108,170 p' oxford_pets_image_segmentation.py > unet_model.py" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Last, we import and save our model:" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "img_size = (224, 224)\n", - "num_classes = 10" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.\n", - "Instructions for updating:\n", - "This property should not be used in TensorFlow 2.0, as updates are applied automatically.\n", - "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.\n", - "Instructions for updating:\n", - "This property should not be used in TensorFlow 2.0, as updates are applied automatically.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/segment_model/assets\n" - ] - } - ], - "source": [ - "from unet_model import get_model\n", - "\n", - "model = get_model(img_size, num_classes)\n", - "\n", - "model_dir = \"tmp_savedmodels/segment_model\"\n", - "model.save(model_dir) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. What batch size(s) am I running inference at?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will create a dummy batch of size 32:" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "dummy_input = np.zeros((32, img_size[0], img_size[1], 3))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can test our original model, before any optimization:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(32, 224, 224, 10)" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "prediction = model.predict(dummy_input)\n", - "\n", - "prediction.shape" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. What precision am I running inference at?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will stick with the same FP32 precision used during training:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "PRECISION = \"FP32\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. What TensorRT tool or integration am I using to convert my model?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will use our example TF-TRT based ModelOptimizer wrapper:" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "from helper import ModelOptimizer\n", - "\n", - "model_opt = ModelOptimizer(model_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Convert to our target precision, saving the result in a new SavedModel:" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Linked TensorRT version: (7, 2, 1)\n", - "INFO:tensorflow:Loaded TensorRT version: (7, 2, 2)\n", - "INFO:tensorflow:Loaded TensorRT 7.2.2 and linked TensorFlow against TensorRT 7.2.1. This is supported because TensorRT minor/patch upgrades are backward compatible\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_1 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_10 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_11 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_2 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_12 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_5 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_3 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_6 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_7 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_4 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_8 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_9 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Could not find TRTEngineOp_0_0 in TF-TRT cache. This can happen if build() is not called, which means TensorRT engines will be built and cached at runtime.\n", - "INFO:tensorflow:Assets written to: tmp_savedmodels/segment_model_FP32/assets\n", - "conversion complete! prediction shape: (32, 224, 224, 10)\n" - ] - } - ], - "source": [ - "opt_trt = model_opt.convert(model_dir+'_FP32', precision=PRECISION)\n", - "print(\"conversion complete! prediction shape:\", opt_trt.predict(dummy_input).shape)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. What TensorRT runtime am I targeting?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will stick to our TF-TRT/Tensorflow runtime:" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Warming up...\n", - "(32, 224, 224, 10)\n", - "(32, 224, 224, 10)\n", - "Done warming up!\n" - ] - } - ], - "source": [ - "print(\"Warming up...\")\n", - "\n", - "print(model.predict(dummy_input).shape)\n", - "print(opt_trt.predict(dummy_input).shape)\n", - "\n", - "print(\"Done warming up!\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Performance Comparisons:" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "182 ms ± 41.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "preds = model.predict(dummy_input)" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "22.3 ms ± 95.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "\n", - "preds = opt_trt.predict(dummy_input)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "source_hidden": true - } - }, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.9" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/quickstart/IntroNotebooks/Additional Examples/helper.py b/quickstart/IntroNotebooks/Additional Examples/helper.py deleted file mode 100644 index c00ed985..00000000 --- a/quickstart/IntroNotebooks/Additional Examples/helper.py +++ /dev/null @@ -1,111 +0,0 @@ -# -# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -from tensorflow.python.compiler.tensorrt import trt_convert as tf_trt -from tensorflow.python.saved_model import tag_constants -import tensorflow as tf -import tensorrt as trt - -import numpy as np - -precision_dict = { - "FP32": tf_trt.TrtPrecisionMode.FP32, - "FP16": tf_trt.TrtPrecisionMode.FP16, - "INT8": tf_trt.TrtPrecisionMode.INT8, -} - -# For TF-TRT: - -class OptimizedModel(): - def __init__(self, saved_model_dir = None): - self.loaded_model_fn = None - - if not saved_model_dir is None: - self.load_model(saved_model_dir) - - - def predict(self, input_data): - if self.loaded_model_fn is None: - raise(Exception("Haven't loaded a model")) - x = tf.constant(input_data.astype('float32')) - labeling = self.loaded_model_fn(x) - try: - preds = labeling['predictions'].numpy() - except: - try: - preds = labeling['probs'].numpy() - except: - try: - preds = labeling[next(iter(labeling.keys()))] - except: - raise(Exception("Failed to get predictions from saved model object")) - return preds - - def load_model(self, saved_model_dir): - saved_model_loaded = tf.saved_model.load(saved_model_dir, tags=[tag_constants.SERVING]) - wrapper_fp32 = saved_model_loaded.signatures['serving_default'] - - self.loaded_model_fn = wrapper_fp32 - -class ModelOptimizer(): - def __init__(self, input_saved_model_dir, calibration_data=None): - self.input_saved_model_dir = input_saved_model_dir - self.calibration_data = None - self.loaded_model = None - - if not calibration_data is None: - self.set_calibration_data(calibration_data) - - - def set_calibration_data(self, calibration_data): - - def calibration_input_fn(): - yield (tf.constant(calibration_data.astype('float32')), ) - - self.calibration_data = calibration_input_fn - - - def convert(self, output_saved_model_dir, precision="FP32", max_workspace_size_bytes=8000000000, **kwargs): - - if precision == "INT8" and self.calibration_data is None: - raise(Exception("No calibration data set!")) - - trt_precision = precision_dict[precision] - conversion_params = tf_trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(precision_mode=trt_precision, - max_workspace_size_bytes=max_workspace_size_bytes, - use_calibration= precision == "INT8") - converter = tf_trt.TrtGraphConverterV2(input_saved_model_dir=self.input_saved_model_dir, - conversion_params=conversion_params) - - if precision == "INT8": - converter.convert(calibration_input_fn=self.calibration_data) - else: - converter.convert() - - converter.save(output_saved_model_dir=output_saved_model_dir) - - return OptimizedModel(output_saved_model_dir) - - def predict(self, input_data): - if self.loaded_model is None: - self.load_default_model() - - return self.loaded_model.predict(input_data) - - def load_default_model(self): - self.loaded_model = tf.keras.models.load_model('resnet50_saved_model') - diff --git a/quickstart/IntroNotebooks/images/tf_onnx.png b/quickstart/IntroNotebooks/images/tf_onnx.png deleted file mode 100644 index f08b3c70..00000000 Binary files a/quickstart/IntroNotebooks/images/tf_onnx.png and /dev/null differ diff --git a/quickstart/IntroNotebooks/images/tf_trt.png b/quickstart/IntroNotebooks/images/tf_trt.png deleted file mode 100644 index e2800821..00000000 Binary files a/quickstart/IntroNotebooks/images/tf_trt.png and /dev/null differ diff --git a/quickstart/quantization_tutorial/qat-ptq-workflow.ipynb b/quickstart/quantization_tutorial/qat-ptq-workflow.ipynb deleted file mode 100644 index cadd4de8..00000000 --- a/quickstart/quantization_tutorial/qat-ptq-workflow.ipynb +++ /dev/null @@ -1,1732 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "b861c182", - "metadata": {}, - "outputs": [], - "source": [ - "# Copyright 2022 NVIDIA Corporation. All Rights Reserved.\n", - "#\n", - "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# http://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License.\n", - "# ==============================================================================" - ] - }, - { - "cell_type": "markdown", - "id": "c6384192", - "metadata": {}, - "source": [ - "\n", - "\n", - "# Accelerate Deep Learning Models using TensorRT " - ] - }, - { - "attachments": { - "img1.JPG": { - "image/jpeg": "" - } - }, - "cell_type": "markdown", - "id": "f5454823", - "metadata": {}, - "source": [ - "## Overview\n", - "\n", - "Deep Learning has touched almost every industry and has transformed the way industries operate and provide services. We perform or experience real-time analytics all the time around us, for example, an advertisement that you saw while swiping through the stories on Instagram, or the video recommendation that floated on your youtube home screen. To cater to these real-time inferences, deep learning practitioners need to maximise model throughput while having highly accurate predictions. Among many techniques, quantization can be used to accelerate models.\n", - "\n", - "Model Quantization is a popular way of optimization which reduces the size of models thereby accelerating inference, while also opening up the possibilities of deployments on devices with lower computation power such as Jetson. Simply put, quantization is a process of mapping input values from a larger set to output values in a smaller set. In the context of deep learning, we often train deep learning models using floating-point 32 bit arithmetic (FP32) as we can take advantage of a wider range of numbers, resulting in more accurate models. The model data (network parameters and activations) are converted from this floating point representation to a lower precision representation, typically using 8-bit integers (int8). In the case of int8, the range [qmin, qmax] would be [-128, 127].\n", - "\n", - "![img1.JPG](attachment:img1.JPG)\n", - "\n", - "A quick rationale of how higher throughput is achieved through quantization can be shown through the following thought experiment: Imagine the complexity of multiplying 3.999x2.999 versus 4x3. The latter is easier to perform than the former. This is the simplicity in calculation seen by quantizing the numbers to lower precision. However, the challenge here is that round errors can result in a lower accuracy model. To address this loss of accuracy, different quantization techniques have been developed. These techniques can be classified into two categories, post-training quantization (PTQ) and quantization-aware training (QAT).\n", - "\n", - "In this notebook, we illustrate the workflow that you can adopt in order to quantize a deep learning model using TensorRT. The notebook takes you through an example of Mobilenetv2 for a classification task on a subset of Imagenet Dataset called Imagenette which has 10 classes. \n", - "\n", - "1. [Requirements](#1)\n", - "2. [Setup a baseline Mobilenetv2 model](#2)\n", - "3. [Convert to TensorRT](#3)\n", - "4. [Post Training Quantization (PTQ)](#4)\n", - "5. [Quantization Aware Training (QAT)](#5)\n", - "6. [Evaluation and Benchmarking](#6)\n", - "7. [Conclusion](#7)\n", - "8. [References](#8)\n", - "\n", - "This notebook is implemented using the NGC pytorch container nvcr.io/nvidia/pytorch:22.04-py3. Follow instructions here https://ngc.nvidia.com/setup/api-key to setup your own API key to use the NGC service through the Docker client. " - ] - }, - { - "cell_type": "markdown", - "id": "06b37d07", - "metadata": {}, - "source": [ - "\n", - "## 1. Requirements\n", - "Please install the required dependencies and import these libraries accordingly" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "0a068b12", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install ipywidgets --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host=files.pythonhosted.org\n", - "!pip install wget\n", - "!pip install pycuda" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "4e2e58b2", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2.1.2\n" - ] - } - ], - "source": [ - "import torch\n", - "import torch.nn as nn\n", - "import torch.optim as optim\n", - "import torch.utils.data as data\n", - "import torchvision.transforms as transforms\n", - "from torchvision import models, datasets\n", - "\n", - "import pytorch_quantization\n", - "from pytorch_quantization import nn as quant_nn\n", - "from pytorch_quantization import quant_modules\n", - "from pytorch_quantization import calib\n", - "from tqdm import tqdm\n", - "\n", - "print(pytorch_quantization.__version__)\n", - "\n", - "import os\n", - "import tensorrt as trt\n", - "import numpy as np\n", - "import time\n", - "import wget\n", - "import tarfile\n", - "import shutil" - ] - }, - { - "cell_type": "markdown", - "id": "0575e590", - "metadata": {}, - "source": [ - "\n", - "## 2. Setup a baseline Mobilenetv2 Model" - ] - }, - { - "cell_type": "markdown", - "id": "a83b886f", - "metadata": {}, - "source": [ - "#### Preparing the Dataset\n", - "\n", - "Imagenette is a subset of ImageNet and has 10 classes. The classes are as follows in the order of their labels : 'tench', 'English springer', 'cassette player', 'chain saw', 'church', 'French horn', 'garbage truck', 'gas pump', 'golf ball' and 'parachute'. " - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "50d60fbe", - "metadata": {}, - "outputs": [], - "source": [ - "def download_data(DATA_DIR):\n", - " if os.path.exists(DATA_DIR):\n", - " if not os.path.exists(os.path.join(DATA_DIR, 'imagenette2-320')):\n", - " url = 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-320.tgz'\n", - " wget.download(url)\n", - " # open file\n", - " file = tarfile.open('imagenette2-320.tgz')\n", - " # extracting file\n", - " file.extractall(DATA_DIR)\n", - " file.close()\n", - " else:\n", - " print(\"This directory doesn't exist. Create the directory and run again\")" - ] - }, - { - "cell_type": "markdown", - "id": "2e25dc45", - "metadata": {}, - "source": [ - "Let's create the data directory if it doesn't exist." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "4a4d8949", - "metadata": {}, - "outputs": [], - "source": [ - "if not os.path.exists(\"./data\"):\n", - " os.mkdir(\"./data\")\n", - "download_data(\"./data\")" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "07d1fc63", - "metadata": {}, - "outputs": [], - "source": [ - "# Define main data directory\n", - "DATA_DIR = './data/imagenette2-320' \n", - "# Define training and validation data paths\n", - "TRAIN_DIR = os.path.join(DATA_DIR, 'train') \n", - "VAL_DIR = os.path.join(DATA_DIR, 'val')" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "acd3cd99", - "metadata": {}, - "outputs": [], - "source": [ - "# Performing Transformations on the dataset and defining training and validation dataloaders\n", - "transform = transforms.Compose([\n", - " transforms.Resize(256),\n", - " transforms.CenterCrop(224),\n", - " transforms.ToTensor(),\n", - " ])\n", - "train_dataset = datasets.ImageFolder(TRAIN_DIR, transform=transform)\n", - "val_dataset = datasets.ImageFolder(VAL_DIR, transform=transform)\n", - "calib_dataset = torch.utils.data.random_split(val_dataset, [2901, 1024])[1]\n", - "\n", - "train_dataloader = data.DataLoader(train_dataset, batch_size=64, shuffle=True, drop_last=True)\n", - "val_dataloader = data.DataLoader(val_dataset, batch_size=64, shuffle=False, drop_last=True)\n", - "calib_dataloader = data.DataLoader(calib_dataset, batch_size=64, shuffle=False, drop_last=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "a2f8914c", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "tensor(0)\n" - ] - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "# Visualising an image from the validation set\n", - "import matplotlib.pyplot as plt\n", - "for images, labels in val_dataloader:\n", - " print(labels[0])\n", - " image = images[0]\n", - " img = image.swapaxes(0, 1)\n", - " img = img.swapaxes(1, 2)\n", - " plt.imshow(img)\n", - " break" - ] - }, - { - "cell_type": "markdown", - "id": "4b7441e6", - "metadata": {}, - "source": [ - "#### Setting up Mobilenetv2\n", - "\n", - "Mobilenetv2 available in Torchvision is pretrained on the ImageNet that has 1000 classes. The Imagenette dataset has 10 classes. \n", - "We set up this model by freezing the weights excpet for the last classification layer and train only the last classification layer to be able to predict the 10 classes of the dataset. " - ] - }, - { - "cell_type": "markdown", - "id": "b9577f2a", - "metadata": {}, - "source": [ - "*Define the Mobilenetv2 model*" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "id": "c29ae7b8", - "metadata": {}, - "outputs": [], - "source": [ - "# This function allows you to set the all the parameters to not have gradients, \n", - "# allowing you to freeze the model and not undergo training during the train step. \n", - "def set_parameter_requires_grad(model, feature_extracting):\n", - " if feature_extracting:\n", - " for param in model.parameters():\n", - " param.requires_grad = False\n", - " \n", - "feature_extract = True #This varaible can be set False if you want to finetune the model by updating all the parameters. \n", - "model = models.mobilenet_v2(pretrained=True)\n", - "set_parameter_requires_grad(model, feature_extract)\n", - "#Define a classification head for 10 classes.\n", - "model.classifier[1] = nn.Linear(1280, 10)\n", - "model = model.cuda()" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "5c03df98", - "metadata": {}, - "outputs": [], - "source": [ - "# Declare Learning rate\n", - "lr = 0.0001\n", - "\n", - "# Use cross entropy loss for classification and SGD optimizer\n", - "criterion = nn.CrossEntropyLoss()\n", - "optimizer = optim.SGD(model.parameters(), lr=lr)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7095a995", - "metadata": {}, - "outputs": [], - "source": [ - "# Define functions for training, evalution, saving checkpoint and train parameter setting function\n", - "def train(model, dataloader, crit, opt, epoch):\n", - " model.train()\n", - " running_loss = 0.0\n", - " for batch, (data, labels) in enumerate(dataloader):\n", - " data, labels = data.cuda(), labels.cuda(non_blocking=True)\n", - " opt.zero_grad()\n", - " out = model(data)\n", - " loss = crit(out, labels)\n", - " loss.backward()\n", - " opt.step()\n", - " running_loss += loss.item()\n", - " if batch % 100 == 99:\n", - " print(\"Batch: [%5d | %5d] loss: %.3f\" % (batch + 1, len(dataloader), running_loss / 100))\n", - " running_loss = 0.0\n", - " \n", - "def evaluate(model, dataloader, crit, epoch):\n", - " total = 0\n", - " correct = 0\n", - " loss = 0.0\n", - " class_probs = []\n", - " class_preds = []\n", - " model.eval()\n", - " with torch.no_grad():\n", - " for data, labels in dataloader:\n", - " data, labels = data.cuda(), labels.cuda(non_blocking=True)\n", - " out = model(data)\n", - " loss += crit(out, labels)\n", - " preds = torch.max(out, 1)[1]\n", - " class_preds.append(preds)\n", - " total += labels.size(0)\n", - " correct += (preds == labels).sum().item()\n", - " return correct / total\n", - "\n", - "def save_checkpoint(state, ckpt_path=\"checkpoint.pth\"):\n", - " torch.save(state, ckpt_path)\n", - " print(\"Checkpoint saved\")\n", - " \n", - "# Helper function to benchmark the model\n", - "cudnn.benchmark = True\n", - "def benchmark(model, input_shape=(1024, 1, 32, 32), dtype='fp32', nwarmup=50, nruns=1000):\n", - " input_data = torch.randn(input_shape)\n", - " input_data = input_data.to(\"cuda\")\n", - " if dtype=='fp16':\n", - " input_data = input_data.half()\n", - " \n", - " with torch.no_grad():\n", - " for _ in range(nwarmup):\n", - " features = model(input_data)\n", - " torch.cuda.synchronize()\n", - " \n", - " timings = []\n", - " with torch.no_grad():\n", - " for i in range(1, nruns+1):\n", - " start_time = time.time()\n", - " output = model(input_data)\n", - " torch.cuda.synchronize()\n", - " end_time = time.time()\n", - " timings.append(end_time - start_time)\n", - "\n", - " print('Average batch time: %.2f ms'%(np.mean(timings)*1000))" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "02a625c9", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Epoch: [ 1 / 5] LR: 0.000100\n", - "Batch: [ 100 | 147] loss: 2.315\n", - "Test Acc: 22.93%\n", - "Epoch: [ 2 / 5] LR: 0.000100\n", - "Batch: [ 100 | 147] loss: 2.177\n", - "Test Acc: 35.09%\n", - "Epoch: [ 3 / 5] LR: 0.000100\n", - "Batch: [ 100 | 147] loss: 2.053\n", - "Test Acc: 49.33%\n", - "Epoch: [ 4 / 5] LR: 0.000100\n", - "Batch: [ 100 | 147] loss: 1.935\n", - "Test Acc: 61.50%\n", - "Epoch: [ 5 / 5] LR: 0.000100\n", - "Batch: [ 100 | 147] loss: 1.836\n", - "Test Acc: 71.11%\n", - "Checkpoint saved\n" - ] - } - ], - "source": [ - "# Train the model for 5 epochs to attain an acceptable accuracy.\n", - "num_epochs=5\n", - "for epoch in range(num_epochs):\n", - " print('Epoch: [%5d / %5d] LR: %f' % (epoch + 1, num_epochs, lr))\n", - "\n", - " train(model, train_dataloader, criterion, optimizer, epoch)\n", - " test_acc = evaluate(model, val_dataloader, criterion, epoch)\n", - "\n", - " print(\"Test Acc: {:.2f}%\".format(100 * test_acc))\n", - " \n", - "save_checkpoint({'epoch': epoch + 1,\n", - " 'model_state_dict': model.state_dict(),\n", - " 'acc': test_acc,\n", - " 'opt_state_dict': optimizer.state_dict()\n", - " },\n", - " ckpt_path=\"models/mobilenetv2_base_ckpt\")" - ] - }, - { - "cell_type": "markdown", - "id": "b829681d", - "metadata": {}, - "source": [ - "We will first generate and evaluate our models and then finally look at the performance to the end of the notebook." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "411d0ebc", - "metadata": { - "scrolled": true - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Mobilenetv2 Baseline accuracy: 71.11%\n" - ] - } - ], - "source": [ - "# Evaluate the baseline model\n", - "test_acc = evaluate(model, val_dataloader, criterion, 0)\n", - "print(\"Mobilenetv2 Baseline accuracy: {:.2f}%\".format(100 * test_acc))" - ] - }, - { - "cell_type": "markdown", - "id": "71fdd581", - "metadata": {}, - "source": [ - "\n", - "### Convert to TensorRT\n", - "\n", - "TensorRT is an SDK facilitating high-performance deep learning inference, optimized to run on NVIDIA GPUs. It accelerates models through graph optimization and quantization. This notebook uses the trtexec CLI tool to build TensorRT engine. " - ] - }, - { - "cell_type": "markdown", - "id": "f75ab9fd", - "metadata": {}, - "source": [ - "Let us convert the above FP32 Mobilenetv2 into a TensorRT engine. Before we do that, we need to first export our model into ONNX format. ONNX is a standard for representing deep learning models enabling them to be transferred between frameworks. The average run time of the TRT model would be the 'GPU Compute Time' printed in the logs." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "e24451cf", - "metadata": { - "scrolled": true - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "&&&& RUNNING TensorRT.trtexec [TensorRT v8205] # trtexec --onnx=models/mobilenetv2_base.onnx --saveEngine=models/mobilenetv2_base.trt\n", - "[07/25/2022-16:42:22] [I] === Model Options ===\n", - "[07/25/2022-16:42:22] [I] Format: ONNX\n", - "[07/25/2022-16:42:22] [I] Model: models/mobilenetv2_base.onnx\n", - "[07/25/2022-16:42:22] [I] Output:\n", - "[07/25/2022-16:42:22] [I] === Build Options ===\n", - "[07/25/2022-16:42:22] [I] Max batch: explicit batch\n", - "[07/25/2022-16:42:22] [I] Workspace: 16 MiB\n", - "[07/25/2022-16:42:22] [I] minTiming: 1\n", - "[07/25/2022-16:42:22] [I] avgTiming: 8\n", - "[07/25/2022-16:42:22] [I] Precision: FP32\n", - "[07/25/2022-16:42:22] [I] Calibration: \n", - "[07/25/2022-16:42:22] [I] Refit: Disabled\n", - "[07/25/2022-16:42:22] [I] Sparsity: Disabled\n", - "[07/25/2022-16:42:22] [I] Safe mode: Disabled\n", - "[07/25/2022-16:42:22] [I] DirectIO mode: Disabled\n", - "[07/25/2022-16:42:22] [I] Restricted mode: Disabled\n", - "[07/25/2022-16:42:22] [I] Save engine: models/mobilenetv2_base.trt\n", - "[07/25/2022-16:42:22] [I] Load engine: \n", - "[07/25/2022-16:42:22] [I] Profiling verbosity: 0\n", - "[07/25/2022-16:42:22] [I] Tactic sources: Using default tactic sources\n", - "[07/25/2022-16:42:22] [I] timingCacheMode: local\n", - "[07/25/2022-16:42:22] [I] timingCacheFile: \n", - "[07/25/2022-16:42:22] [I] Input(s)s format: fp32:CHW\n", - "[07/25/2022-16:42:22] [I] Output(s)s format: fp32:CHW\n", - "[07/25/2022-16:42:22] [I] Input build shapes: model\n", - "[07/25/2022-16:42:22] [I] Input calibration shapes: model\n", - "[07/25/2022-16:42:22] [I] === System Options ===\n", - "[07/25/2022-16:42:22] [I] Device: 0\n", - "[07/25/2022-16:42:22] [I] DLACore: \n", - "[07/25/2022-16:42:22] [I] Plugins:\n", - "[07/25/2022-16:42:22] [I] === Inference Options ===\n", - "[07/25/2022-16:42:22] [I] Batch: Explicit\n", - "[07/25/2022-16:42:22] [I] Input inference shapes: model\n", - "[07/25/2022-16:42:22] [I] Iterations: 10\n", - "[07/25/2022-16:42:22] [I] Duration: 3s (+ 200ms warm up)\n", - "[07/25/2022-16:42:22] [I] Sleep time: 0ms\n", - "[07/25/2022-16:42:22] [I] Idle time: 0ms\n", - "[07/25/2022-16:42:22] [I] Streams: 1\n", - "[07/25/2022-16:42:22] [I] ExposeDMA: Disabled\n", - "[07/25/2022-16:42:22] [I] Data transfers: Enabled\n", - "[07/25/2022-16:42:22] [I] Spin-wait: Disabled\n", - "[07/25/2022-16:42:22] [I] Multithreading: Disabled\n", - "[07/25/2022-16:42:22] [I] CUDA Graph: Disabled\n", - "[07/25/2022-16:42:22] [I] Separate profiling: Disabled\n", - "[07/25/2022-16:42:22] [I] Time Deserialize: Disabled\n", - "[07/25/2022-16:42:22] [I] Time Refit: Disabled\n", - "[07/25/2022-16:42:22] [I] Skip inference: Disabled\n", - "[07/25/2022-16:42:22] [I] Inputs:\n", - "[07/25/2022-16:42:22] [I] === Reporting Options ===\n", - "[07/25/2022-16:42:22] [I] Verbose: Disabled\n", - "[07/25/2022-16:42:22] [I] Averages: 10 inferences\n", - "[07/25/2022-16:42:22] [I] Percentile: 99\n", - "[07/25/2022-16:42:22] [I] Dump refittable layers:Disabled\n", - "[07/25/2022-16:42:22] [I] Dump output: Disabled\n", - "[07/25/2022-16:42:22] [I] Profile: Disabled\n", - "[07/25/2022-16:42:22] [I] Export timing to JSON file: \n", - "[07/25/2022-16:42:22] [I] Export output to JSON file: \n", - "[07/25/2022-16:42:22] [I] Export profile to JSON file: \n", - "[07/25/2022-16:42:22] [I] \n", - "[07/25/2022-16:42:22] [I] === Device Information ===\n", - "[07/25/2022-16:42:22] [I] Selected Device: NVIDIA Graphics Device\n", - "[07/25/2022-16:42:22] [I] Compute Capability: 8.0\n", - "[07/25/2022-16:42:22] [I] SMs: 124\n", - "[07/25/2022-16:42:22] [I] Compute Clock Rate: 1.005 GHz\n", - "[07/25/2022-16:42:22] [I] Device Global Memory: 47681 MiB\n", - "[07/25/2022-16:42:22] [I] Shared Memory per SM: 164 KiB\n", - "[07/25/2022-16:42:22] [I] Memory Bus Width: 6144 bits (ECC enabled)\n", - "[07/25/2022-16:42:22] [I] Memory Clock Rate: 1.215 GHz\n", - "[07/25/2022-16:42:22] [I] \n", - "[07/25/2022-16:42:22] [I] TensorRT version: 8.2.5\n", - "[07/25/2022-16:42:23] [I] [TRT] [MemUsageChange] Init CUDA: CPU +440, GPU +0, now: CPU 452, GPU 5848 (MiB)\n", - "[07/25/2022-16:42:23] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 452 MiB, GPU 5848 MiB\n", - "[07/25/2022-16:42:23] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 669 MiB, GPU 5920 MiB\n", - "[07/25/2022-16:42:23] [I] Start parsing network model\n", - "[07/25/2022-16:42:23] [I] [TRT] ----------------------------------------------------------------\n", - "[07/25/2022-16:42:23] [I] [TRT] Input filename: models/mobilenetv2_base.onnx\n", - "[07/25/2022-16:42:23] [I] [TRT] ONNX IR version: 0.0.7\n", - "[07/25/2022-16:42:23] [I] [TRT] Opset version: 13\n", - "[07/25/2022-16:42:23] [I] [TRT] Producer name: pytorch\n", - "[07/25/2022-16:42:23] [I] [TRT] Producer version: 1.13.0\n", - "[07/25/2022-16:42:23] [I] [TRT] Domain: \n", - "[07/25/2022-16:42:23] [I] [TRT] Model version: 0\n", - "[07/25/2022-16:42:23] [I] [TRT] Doc string: \n", - "[07/25/2022-16:42:23] [I] [TRT] ----------------------------------------------------------------\n", - "[07/25/2022-16:42:23] [I] Finish parsing network model\n", - "[07/25/2022-16:42:24] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +839, GPU +362, now: CPU 1532, GPU 6290 (MiB)\n", - "[07/25/2022-16:42:24] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +128, GPU +58, now: CPU 1660, GPU 6348 (MiB)\n", - "[07/25/2022-16:42:24] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.\n", - "[07/25/2022-16:42:28] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.\n", - "[07/25/2022-16:43:21] [I] [TRT] Detected 1 inputs and 1 output network tensors.\n", - "[07/25/2022-16:43:21] [I] [TRT] Total Host Persistent Memory: 82528\n", - "[07/25/2022-16:43:21] [I] [TRT] Total Device Persistent Memory: 8861184\n", - "[07/25/2022-16:43:21] [I] [TRT] Total Scratch Memory: 4194304\n", - "[07/25/2022-16:43:21] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 8 MiB, GPU 624 MiB\n", - "[07/25/2022-16:43:21] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 1.67234ms to assign 4 blocks to 59 nodes requiring 449576960 bytes.\n", - "[07/25/2022-16:43:21] [I] [TRT] Total Activation Memory: 449576960\n", - "[07/25/2022-16:43:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2512, GPU 6760 (MiB)\n", - "[07/25/2022-16:43:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2513, GPU 6770 (MiB)\n", - "[07/25/2022-16:43:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +9, now: CPU 0, GPU 9 (MiB)\n", - "[07/25/2022-16:43:21] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2521, GPU 6724 (MiB)\n", - "[07/25/2022-16:43:21] [I] [TRT] Loaded engine size: 10 MiB\n", - "[07/25/2022-16:43:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 2522, GPU 6746 (MiB)\n", - "[07/25/2022-16:43:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2523, GPU 6754 (MiB)\n", - "[07/25/2022-16:43:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +8, now: CPU 0, GPU 8 (MiB)\n", - "[07/25/2022-16:43:22] [I] Engine built in 59.1433 sec.\n", - "[07/25/2022-16:43:22] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 2289, GPU 6696 (MiB)\n", - "[07/25/2022-16:43:22] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2290, GPU 6704 (MiB)\n", - "[07/25/2022-16:43:22] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +438, now: CPU 0, GPU 446 (MiB)\n", - "[07/25/2022-16:43:22] [I] Using random values for input input.1\n", - "[07/25/2022-16:43:22] [I] Created input binding for input.1 with dimensions 64x3x224x224\n", - "[07/25/2022-16:43:22] [I] Using random values for output 536\n", - "[07/25/2022-16:43:22] [I] Created output binding for 536 with dimensions 64x10\n", - "[07/25/2022-16:43:22] [I] Starting inference\n", - "[07/25/2022-16:43:25] [I] Warmup completed 34 queries over 200 ms\n", - "[07/25/2022-16:43:25] [I] Timing trace has 501 queries over 3.01732 s\n", - "[07/25/2022-16:43:25] [I] \n", - "[07/25/2022-16:43:25] [I] === Trace details ===\n", - "[07/25/2022-16:43:25] [I] Trace averages of 10 runs:\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.88872 ms - Host latency: 8.93236 ms (end to end 11.4268 ms, enqueue 2.05089 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.88841 ms - Host latency: 8.92661 ms (end to end 11.4266 ms, enqueue 2.07079 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.88647 ms - Host latency: 8.93378 ms (end to end 11.4245 ms, enqueue 2.07513 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.8879 ms - Host latency: 8.93474 ms (end to end 11.4218 ms, enqueue 2.04516 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.88708 ms - Host latency: 8.92472 ms (end to end 11.2913 ms, enqueue 2.04477 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.88964 ms - Host latency: 8.93073 ms (end to end 11.4241 ms, enqueue 2.04273 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.89016 ms - Host latency: 8.92474 ms (end to end 11.4283 ms, enqueue 2.04633 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.88841 ms - Host latency: 8.92583 ms (end to end 11.4307 ms, enqueue 2.05944 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.88973 ms - Host latency: 8.92712 ms (end to end 11.4225 ms, enqueue 2.06941 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.88892 ms - Host latency: 8.92521 ms (end to end 11.4224 ms, enqueue 2.05708 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.92097 ms - Host latency: 8.96465 ms (end to end 11.4841 ms, enqueue 2.04125 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.09852 ms - Host latency: 9.13358 ms (end to end 11.7906 ms, enqueue 2.04748 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.44015 ms - Host latency: 9.47874 ms (end to end 12.5498 ms, enqueue 2.05565 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.25358 ms - Host latency: 9.28981 ms (end to end 12.1605 ms, enqueue 2.05262 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.14546 ms - Host latency: 9.18715 ms (end to end 11.9508 ms, enqueue 2.06964 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.02576 ms - Host latency: 9.06241 ms (end to end 11.7147 ms, enqueue 2.04923 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.92704 ms - Host latency: 8.96814 ms (end to end 11.5024 ms, enqueue 2.04821 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.01957 ms - Host latency: 9.05573 ms (end to end 11.6706 ms, enqueue 2.04988 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.94579 ms - Host latency: 8.98406 ms (end to end 11.5354 ms, enqueue 2.13973 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.89835 ms - Host latency: 8.94883 ms (end to end 11.4496 ms, enqueue 2.08344 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.99513 ms - Host latency: 9.03672 ms (end to end 11.6076 ms, enqueue 2.0929 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.08859 ms - Host latency: 9.12035 ms (end to end 11.8224 ms, enqueue 2.06177 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.9467 ms - Host latency: 8.987 ms (end to end 11.5444 ms, enqueue 2.06372 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.89579 ms - Host latency: 8.93199 ms (end to end 11.4334 ms, enqueue 2.04498 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.91914 ms - Host latency: 8.95847 ms (end to end 11.4744 ms, enqueue 2.06753 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.92528 ms - Host latency: 8.96528 ms (end to end 11.4935 ms, enqueue 2.05543 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.92607 ms - Host latency: 8.96593 ms (end to end 11.4996 ms, enqueue 2.05464 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.11134 ms - Host latency: 9.14991 ms (end to end 11.8276 ms, enqueue 2.06058 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.24971 ms - Host latency: 9.2879 ms (end to end 12.1685 ms, enqueue 2.05168 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.1583 ms - Host latency: 9.19552 ms (end to end 11.9784 ms, enqueue 2.05416 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.03793 ms - Host latency: 9.07539 ms (end to end 11.7194 ms, enqueue 2.04376 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.03723 ms - Host latency: 9.07742 ms (end to end 11.7207 ms, enqueue 2.04446 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.16055 ms - Host latency: 9.1936 ms (end to end 11.9269 ms, enqueue 2.06987 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.24443 ms - Host latency: 9.28486 ms (end to end 12.1531 ms, enqueue 2.04836 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.94749 ms - Host latency: 8.98728 ms (end to end 11.5623 ms, enqueue 2.05354 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.91284 ms - Host latency: 8.95781 ms (end to end 11.4716 ms, enqueue 2.04207 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.98567 ms - Host latency: 9.02083 ms (end to end 11.6108 ms, enqueue 2.04358 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.93113 ms - Host latency: 8.97266 ms (end to end 11.533 ms, enqueue 2.06318 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.89543 ms - Host latency: 8.92844 ms (end to end 11.4434 ms, enqueue 2.05273 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.95349 ms - Host latency: 8.99211 ms (end to end 11.5469 ms, enqueue 2.07312 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.94853 ms - Host latency: 8.98025 ms (end to end 11.5569 ms, enqueue 2.04573 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.97844 ms - Host latency: 9.01548 ms (end to end 11.6017 ms, enqueue 2.05762 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 5.96038 ms - Host latency: 9.00027 ms (end to end 11.5838 ms, enqueue 2.04302 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.03623 ms - Host latency: 9.07041 ms (end to end 11.7005 ms, enqueue 2.05886 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.08901 ms - Host latency: 9.12502 ms (end to end 11.8232 ms, enqueue 2.06831 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.07283 ms - Host latency: 9.11433 ms (end to end 11.8008 ms, enqueue 2.07654 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.10923 ms - Host latency: 9.14961 ms (end to end 11.8509 ms, enqueue 2.05337 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.05793 ms - Host latency: 9.09639 ms (end to end 11.776 ms, enqueue 2.06641 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.1678 ms - Host latency: 9.20984 ms (end to end 11.9751 ms, enqueue 2.05627 ms)\n", - "[07/25/2022-16:43:25] [I] Average on 10 runs - GPU latency: 6.05857 ms - Host latency: 9.0998 ms (end to end 11.7799 ms, enqueue 2.06199 ms)\n", - "[07/25/2022-16:43:25] [I] \n", - "[07/25/2022-16:43:25] [I] === Performance summary ===\n", - "[07/25/2022-16:43:25] [I] Throughput: 166.041 qps\n", - "[07/25/2022-16:43:25] [I] Latency: min = 8.90146 ms, max = 9.52582 ms, mean = 9.04623 ms, median = 8.99896 ms, percentile(99%) = 9.50714 ms\n", - "[07/25/2022-16:43:25] [I] End-to-End Host Latency: min = 10.1021 ms, max = 12.6202 ms, mean = 11.6584 ms, median = 11.5563 ms, percentile(99%) = 12.5932 ms\n", - "[07/25/2022-16:43:25] [I] Enqueue Time: min = 1.93103 ms, max = 2.48816 ms, mean = 2.05872 ms, median = 2.05432 ms, percentile(99%) = 2.24194 ms\n", - "[07/25/2022-16:43:25] [I] H2D Latency: min = 3.00195 ms, max = 3.14062 ms, mean = 3.03002 ms, median = 3.02588 ms, percentile(99%) = 3.08609 ms\n", - "[07/25/2022-16:43:25] [I] GPU Compute Time: min = 5.87982 ms, max = 6.47681 ms, mean = 6.00728 ms, median = 5.94946 ms, percentile(99%) = 6.47375 ms\n", - "[07/25/2022-16:43:25] [I] D2H Latency: min = 0.00708008 ms, max = 0.0134277 ms, mean = 0.00893093 ms, median = 0.00878906 ms, percentile(99%) = 0.0117188 ms\n", - "[07/25/2022-16:43:25] [I] Total Host Walltime: 3.01732 s\n", - "[07/25/2022-16:43:25] [I] Total GPU Compute Time: 3.00965 s\n", - "[07/25/2022-16:43:25] [I] Explanations of the performance metrics are printed in the verbose logs.\n", - "[07/25/2022-16:43:25] [I] \n", - "&&&& PASSED TensorRT.trtexec [TensorRT v8205] # trtexec --onnx=models/mobilenetv2_base.onnx --saveEngine=models/mobilenetv2_base.trt\n" - ] - } - ], - "source": [ - "# Exporting to Onnx\n", - "dummy_input = torch.randn(64, 3, 224, 224, device='cuda')\n", - "input_names = [ \"actual_input_1\" ]\n", - "output_names = [ \"output1\" ]\n", - "torch.onnx.export(\n", - " model,\n", - " dummy_input,\n", - " \"models/mobilenetv2_base.onnx\",\n", - " verbose=False,\n", - " opset_version=13,\n", - " do_constant_folding = False)\n", - "\n", - "# Converting ONNX model to TRT\n", - "!trtexec --onnx=models/mobilenetv2_base.onnx --saveEngine=models/mobilenetv2_base.trt" - ] - }, - { - "cell_type": "markdown", - "id": "0a079b97", - "metadata": {}, - "source": [ - "\n", - "## 4. Post Training Quantization (PTQ)" - ] - }, - { - "attachments": { - "img4.JPG": { - "image/jpeg": "" - } - }, - "cell_type": "markdown", - "id": "bf3d4397", - "metadata": {}, - "source": [ - "As the name suggests, PTQ is performed on a trained model that has achieved acceptable accuracy. It is effective and also quick to implement because it does not require any retraining of the network. Now that we have the trained checkpoint ready, let's start quantizing the model. \n", - "\n", - "To perform PTQ, we perform inference in FP32 on calibration data, a subset of training or validation data, to determine the range of representable FP32 values to be quantized. This gives us the scale that can be used to map the values to the quantized range. We call this process of choosing the input range \"Calibration\". The three popular techniques used to calibrate are:\n", - "\n", - "- Min-Max: Use the minimum and maximum of the FP32 values seen during calibration. The disadvantage with this method is that, if there is an outlier, our mapping can induce a larger rounding error. \n", - "\n", - "- Entropy: Not all values in the FP32 tensor may be equally important. Hence using cross entropy with different range values [T1, T2], we try to minimize the information loss between the original FP32 tensor and quantized tensor. \n", - "\n", - "- Percentile: Use the percentile of the distribution of absolute values seen during calibration. Say, at 99% calibration, we clip 1% of the largest magnitude values, and determine [P1, P2] as the representable range to be quantized\n", - "\n", - "\n", - "![img4.JPG](attachment:img4.JPG)\n", - "\n", - "\n", - "We will be using the Pytorch Quantization toolkit, a toolkit built for training and evaluating PyTorch Models with simulated quantization. \n", - "\n", - "`quant_modules.initialize()` will ensure quantized modules are called instead of original modules. For example, when you define a model with convolution, linear snd pooling layers, you will make a call to `QuantConv2d`, `QuantLinear` and `QuantPooling` respectively. `QuantConv2d` basically wraps quantizer nodes around inputs and weights of regular `Conv2d`. Please refer to all the quantized modules in pytorch-quantization toolkit for more information. " - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "f1520afc", - "metadata": {}, - "outputs": [], - "source": [ - "quant_modules.initialize()" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "ee09402f", - "metadata": {}, - "outputs": [], - "source": [ - "# We define Mobilenetv2 again just like we did above\n", - "# All the regular conv, FC layers will be converted to their quantized counterparts due to quant_modules.initialize()\n", - "feature_extract = True\n", - "q_model = models.mobilenet_v2(pretrained=True)\n", - "set_parameter_requires_grad(q_model, feature_extract)\n", - "q_model.classifier[1] = nn.Linear(1280, 10)\n", - "q_model = q_model.cuda()\n", - "\n", - "# mobilenetv2_base_ckpt is the checkpoint generated from Step 2 : Training a baseline Mobilenetv2 model.\n", - "ckpt = torch.load(\"./models/mobilenetv2_base_ckpt\")\n", - "modified_state_dict={}\n", - "for key, val in ckpt[\"model_state_dict\"].items():\n", - " # Remove 'module.' from the key names\n", - " if key.startswith('module'):\n", - " modified_state_dict[key[7:]] = val\n", - " else:\n", - " modified_state_dict[key] = val\n", - "\n", - "# Load the pre-trained checkpoint\n", - "q_model.load_state_dict(modified_state_dict)\n", - "optimizer.load_state_dict(ckpt[\"opt_state_dict\"])" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "b8726956", - "metadata": {}, - "outputs": [], - "source": [ - "def compute_amax(model, **kwargs):\n", - " # Load calib result\n", - " for name, module in model.named_modules():\n", - " if isinstance(module, quant_nn.TensorQuantizer):\n", - " if module._calibrator is not None:\n", - " if isinstance(module._calibrator, calib.MaxCalibrator):\n", - " module.load_calib_amax()\n", - " else:\n", - " module.load_calib_amax(**kwargs)\n", - " model.cuda()\n", - "\n", - "def collect_stats(model, data_loader, num_batches):\n", - " \"\"\"Feed data to the network and collect statistics\"\"\"\n", - " # Enable calibrators\n", - " for name, module in model.named_modules():\n", - " if isinstance(module, quant_nn.TensorQuantizer):\n", - " if module._calibrator is not None:\n", - " module.disable_quant()\n", - " module.enable_calib()\n", - " else:\n", - " module.disable()\n", - "\n", - " # Feed data to the network for collecting stats\n", - " for i, (image, _) in tqdm(enumerate(data_loader), total=num_batches):\n", - " model(image.cuda())\n", - " if i >= num_batches:\n", - " break\n", - "\n", - " # Disable calibrators\n", - " for name, module in model.named_modules():\n", - " if isinstance(module, quant_nn.TensorQuantizer):\n", - " if module._calibrator is not None:\n", - " module.enable_quant()\n", - " module.disable_calib()\n", - " else:\n", - " module.enable()" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "da627181", - "metadata": {}, - "outputs": [], - "source": [ - "# Calibrate the model using max calibration technique.\n", - "with torch.no_grad():\n", - " collect_stats(q_model, train_dataloader, num_batches=16)\n", - " compute_amax(q_model, method=\"max\")" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "73e6d51c", - "metadata": {}, - "outputs": [], - "source": [ - "# Save the PTQ model\n", - "torch.save(q_model.state_dict(), \"./models/mobilenetv2_ptq.pth\")" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "c7dadbf2", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Mobilenetv2 PTQ accuracy: 68.11%\n" - ] - } - ], - "source": [ - "# Evaluate the PTQ Model \n", - "test_acc = evaluate(q_model, val_dataloader, criterion, 0)\n", - "print(\"Mobilenetv2 PTQ accuracy: {:.2f}%\".format(100 * test_acc))" - ] - }, - { - "cell_type": "markdown", - "id": "efd5ff11", - "metadata": {}, - "source": [ - "Let us now prepare this model to export into ONNX. Setting `quant_nn.TensorQuantizer.use_fb_fake_quant = True` enables the quantized model to use `torch.fake_quantize_per_tensor_affine` and `torch.fake_quantize_per_channel_affine` operators instead of `tensor_quant` function to export quantization operators. " - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "3f10f707", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "W0725 16:43:50.537823 139848660895552 tensor_quantizer.py:280] Use Pytorch's native experimental fake quantization.\n", - "/opt/conda/lib/python3.8/site-packages/pytorch_quantization/nn/modules/tensor_quantizer.py:283: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", - " if amax.numel() == 1:\n", - "/opt/conda/lib/python3.8/site-packages/pytorch_quantization/nn/modules/tensor_quantizer.py:285: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", - " inputs, amax.item() / bound, 0,\n", - "/opt/conda/lib/python3.8/site-packages/pytorch_quantization/nn/modules/tensor_quantizer.py:291: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", - " quant_dim = list(amax.shape).index(list(amax_sequeeze.shape)[0])\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "&&&& RUNNING TensorRT.trtexec [TensorRT v8205] # trtexec --onnx=models/mobilenetv2_ptq.onnx --int8 --saveEngine=models/mobilenetv2_ptq.trt\n", - "[07/25/2022-16:43:56] [I] === Model Options ===\n", - "[07/25/2022-16:43:56] [I] Format: ONNX\n", - "[07/25/2022-16:43:56] [I] Model: models/mobilenetv2_ptq.onnx\n", - "[07/25/2022-16:43:56] [I] Output:\n", - "[07/25/2022-16:43:56] [I] === Build Options ===\n", - "[07/25/2022-16:43:56] [I] Max batch: explicit batch\n", - "[07/25/2022-16:43:56] [I] Workspace: 16 MiB\n", - "[07/25/2022-16:43:56] [I] minTiming: 1\n", - "[07/25/2022-16:43:56] [I] avgTiming: 8\n", - "[07/25/2022-16:43:56] [I] Precision: FP32+INT8\n", - "[07/25/2022-16:43:56] [I] Calibration: Dynamic\n", - "[07/25/2022-16:43:56] [I] Refit: Disabled\n", - "[07/25/2022-16:43:56] [I] Sparsity: Disabled\n", - "[07/25/2022-16:43:56] [I] Safe mode: Disabled\n", - "[07/25/2022-16:43:56] [I] DirectIO mode: Disabled\n", - "[07/25/2022-16:43:56] [I] Restricted mode: Disabled\n", - "[07/25/2022-16:43:56] [I] Save engine: models/mobilenetv2_ptq.trt\n", - "[07/25/2022-16:43:56] [I] Load engine: \n", - "[07/25/2022-16:43:56] [I] Profiling verbosity: 0\n", - "[07/25/2022-16:43:56] [I] Tactic sources: Using default tactic sources\n", - "[07/25/2022-16:43:56] [I] timingCacheMode: local\n", - "[07/25/2022-16:43:56] [I] timingCacheFile: \n", - "[07/25/2022-16:43:56] [I] Input(s)s format: fp32:CHW\n", - "[07/25/2022-16:43:56] [I] Output(s)s format: fp32:CHW\n", - "[07/25/2022-16:43:56] [I] Input build shapes: model\n", - "[07/25/2022-16:43:56] [I] Input calibration shapes: model\n", - "[07/25/2022-16:43:56] [I] === System Options ===\n", - "[07/25/2022-16:43:56] [I] Device: 0\n", - "[07/25/2022-16:43:56] [I] DLACore: \n", - "[07/25/2022-16:43:56] [I] Plugins:\n", - "[07/25/2022-16:43:56] [I] === Inference Options ===\n", - "[07/25/2022-16:43:56] [I] Batch: Explicit\n", - "[07/25/2022-16:43:56] [I] Input inference shapes: model\n", - "[07/25/2022-16:43:56] [I] Iterations: 10\n", - "[07/25/2022-16:43:56] [I] Duration: 3s (+ 200ms warm up)\n", - "[07/25/2022-16:43:56] [I] Sleep time: 0ms\n", - "[07/25/2022-16:43:56] [I] Idle time: 0ms\n", - "[07/25/2022-16:43:56] [I] Streams: 1\n", - "[07/25/2022-16:43:56] [I] ExposeDMA: Disabled\n", - "[07/25/2022-16:43:56] [I] Data transfers: Enabled\n", - "[07/25/2022-16:43:56] [I] Spin-wait: Disabled\n", - "[07/25/2022-16:43:56] [I] Multithreading: Disabled\n", - "[07/25/2022-16:43:56] [I] CUDA Graph: Disabled\n", - "[07/25/2022-16:43:56] [I] Separate profiling: Disabled\n", - "[07/25/2022-16:43:56] [I] Time Deserialize: Disabled\n", - "[07/25/2022-16:43:56] [I] Time Refit: Disabled\n", - "[07/25/2022-16:43:56] [I] Skip inference: Disabled\n", - "[07/25/2022-16:43:56] [I] Inputs:\n", - "[07/25/2022-16:43:56] [I] === Reporting Options ===\n", - "[07/25/2022-16:43:56] [I] Verbose: Disabled\n", - "[07/25/2022-16:43:56] [I] Averages: 10 inferences\n", - "[07/25/2022-16:43:56] [I] Percentile: 99\n", - "[07/25/2022-16:43:56] [I] Dump refittable layers:Disabled\n", - "[07/25/2022-16:43:56] [I] Dump output: Disabled\n", - "[07/25/2022-16:43:56] [I] Profile: Disabled\n", - "[07/25/2022-16:43:56] [I] Export timing to JSON file: \n", - "[07/25/2022-16:43:56] [I] Export output to JSON file: \n", - "[07/25/2022-16:43:56] [I] Export profile to JSON file: \n", - "[07/25/2022-16:43:56] [I] \n", - "[07/25/2022-16:43:56] [I] === Device Information ===\n", - "[07/25/2022-16:43:56] [I] Selected Device: NVIDIA Graphics Device\n", - "[07/25/2022-16:43:56] [I] Compute Capability: 8.0\n", - "[07/25/2022-16:43:56] [I] SMs: 124\n", - "[07/25/2022-16:43:56] [I] Compute Clock Rate: 1.005 GHz\n", - "[07/25/2022-16:43:56] [I] Device Global Memory: 47681 MiB\n", - "[07/25/2022-16:43:56] [I] Shared Memory per SM: 164 KiB\n", - "[07/25/2022-16:43:56] [I] Memory Bus Width: 6144 bits (ECC enabled)\n", - "[07/25/2022-16:43:56] [I] Memory Clock Rate: 1.215 GHz\n", - "[07/25/2022-16:43:56] [I] \n", - "[07/25/2022-16:43:56] [I] TensorRT version: 8.2.5\n", - "[07/25/2022-16:43:57] [I] [TRT] [MemUsageChange] Init CUDA: CPU +440, GPU +0, now: CPU 452, GPU 5862 (MiB)\n", - "[07/25/2022-16:43:57] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 452 MiB, GPU 5862 MiB\n", - "[07/25/2022-16:43:57] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 669 MiB, GPU 5934 MiB\n", - "[07/25/2022-16:43:57] [I] Start parsing network model\n", - "[07/25/2022-16:43:57] [I] [TRT] ----------------------------------------------------------------\n", - "[07/25/2022-16:43:57] [I] [TRT] Input filename: models/mobilenetv2_ptq.onnx\n", - "[07/25/2022-16:43:57] [I] [TRT] ONNX IR version: 0.0.7\n", - "[07/25/2022-16:43:57] [I] [TRT] Opset version: 13\n", - "[07/25/2022-16:43:57] [I] [TRT] Producer name: pytorch\n", - "[07/25/2022-16:43:57] [I] [TRT] Producer version: 1.13.0\n", - "[07/25/2022-16:43:57] [I] [TRT] Domain: \n", - "[07/25/2022-16:43:57] [I] [TRT] Model version: 0\n", - "[07/25/2022-16:43:57] [I] [TRT] Doc string: \n", - "[07/25/2022-16:43:57] [I] [TRT] ----------------------------------------------------------------\n", - "[07/25/2022-16:43:57] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:506: Your ONNX model has been generated with double-typed weights, while TensorRT does not natively support double. Attempting to cast down to float.\n", - "[07/25/2022-16:43:57] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:368: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.\n", - "[07/25/2022-16:43:57] [I] Finish parsing network model\n", - "[07/25/2022-16:43:57] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best\n", - "[07/25/2022-16:43:58] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes.\n", - "[07/25/2022-16:43:59] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +838, GPU +362, now: CPU 1543, GPU 6342 (MiB)\n", - "[07/25/2022-16:43:59] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +128, GPU +58, now: CPU 1671, GPU 6400 (MiB)\n", - "[07/25/2022-16:43:59] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.\n", - "[07/25/2022-16:44:20] [I] [TRT] Detected 1 inputs and 1 output network tensors.\n", - "[07/25/2022-16:44:21] [I] [TRT] Total Host Persistent Memory: 75056\n", - "[07/25/2022-16:44:21] [I] [TRT] Total Device Persistent Memory: 2367488\n", - "[07/25/2022-16:44:21] [I] [TRT] Total Scratch Memory: 0\n", - "[07/25/2022-16:44:21] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 11 MiB, GPU 184 MiB\n", - "[07/25/2022-16:44:21] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 3.69334ms to assign 4 blocks to 87 nodes requiring 131661824 bytes.\n", - "[07/25/2022-16:44:21] [I] [TRT] Total Activation Memory: 131661824\n", - "[07/25/2022-16:44:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1674, GPU 6412 (MiB)\n", - "[07/25/2022-16:44:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1674, GPU 6422 (MiB)\n", - "[07/25/2022-16:44:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +2, GPU +4, now: CPU 2, GPU 4 (MiB)\n", - "[07/25/2022-16:44:21] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1665, GPU 6384 (MiB)\n", - "[07/25/2022-16:44:21] [I] [TRT] Loaded engine size: 2 MiB\n", - "[07/25/2022-16:44:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1666, GPU 6398 (MiB)\n", - "[07/25/2022-16:44:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1666, GPU 6406 (MiB)\n", - "[07/25/2022-16:44:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +2, now: CPU 0, GPU 2 (MiB)\n", - "[07/25/2022-16:44:21] [I] Engine built in 24.535 sec.\n", - "[07/25/2022-16:44:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1435, GPU 6312 (MiB)\n", - "[07/25/2022-16:44:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 1436, GPU 6320 (MiB)\n", - "[07/25/2022-16:44:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +128, now: CPU 0, GPU 130 (MiB)\n", - "[07/25/2022-16:44:21] [I] Using random values for input inputs.1\n", - "[07/25/2022-16:44:21] [I] Created input binding for inputs.1 with dimensions 64x3x224x224\n", - "[07/25/2022-16:44:21] [I] Using random values for output 1225\n", - "[07/25/2022-16:44:21] [I] Created output binding for 1225 with dimensions 64x10\n", - "[07/25/2022-16:44:21] [I] Starting inference\n", - "[07/25/2022-16:44:24] [I] Warmup completed 64 queries over 200 ms\n", - "[07/25/2022-16:44:24] [I] Timing trace has 967 queries over 3.00851 s\n", - "[07/25/2022-16:44:24] [I] \n", - "[07/25/2022-16:44:24] [I] === Trace details ===\n", - "[07/25/2022-16:44:24] [I] Trace averages of 10 runs:\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.62079 ms - Host latency: 4.67811 ms (end to end 4.69463 ms, enqueue 1.64643 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58894 ms - Host latency: 4.64457 ms (end to end 4.66023 ms, enqueue 1.64765 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58884 ms - Host latency: 4.68113 ms (end to end 4.69763 ms, enqueue 1.4498 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59182 ms - Host latency: 4.74732 ms (end to end 4.76547 ms, enqueue 1.01564 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.57819 ms - Host latency: 4.72507 ms (end to end 4.74366 ms, enqueue 1.02484 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.57656 ms - Host latency: 4.72242 ms (end to end 4.74165 ms, enqueue 1.02861 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.57644 ms - Host latency: 4.71519 ms (end to end 4.7332 ms, enqueue 1.01613 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58525 ms - Host latency: 4.71598 ms (end to end 4.73434 ms, enqueue 1.02659 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58402 ms - Host latency: 4.73148 ms (end to end 4.74992 ms, enqueue 1.01769 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.5875 ms - Host latency: 4.73852 ms (end to end 4.75818 ms, enqueue 1.01811 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58987 ms - Host latency: 4.73746 ms (end to end 4.75689 ms, enqueue 1.03277 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58997 ms - Host latency: 4.7413 ms (end to end 4.75951 ms, enqueue 1.01619 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58946 ms - Host latency: 4.72262 ms (end to end 4.74041 ms, enqueue 1.02238 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58925 ms - Host latency: 4.73135 ms (end to end 4.74933 ms, enqueue 1.01594 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59028 ms - Host latency: 4.73451 ms (end to end 4.75285 ms, enqueue 1.02201 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58268 ms - Host latency: 4.73112 ms (end to end 4.74874 ms, enqueue 1.02508 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58301 ms - Host latency: 4.72178 ms (end to end 4.74047 ms, enqueue 1.01762 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58886 ms - Host latency: 4.65172 ms (end to end 4.66926 ms, enqueue 1.51528 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58914 ms - Host latency: 4.64406 ms (end to end 4.65896 ms, enqueue 1.63688 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58699 ms - Host latency: 4.64383 ms (end to end 4.65996 ms, enqueue 1.65472 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58555 ms - Host latency: 4.64166 ms (end to end 4.65729 ms, enqueue 1.63208 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.5912 ms - Host latency: 4.70112 ms (end to end 4.71844 ms, enqueue 1.32826 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59344 ms - Host latency: 4.73959 ms (end to end 4.75857 ms, enqueue 1.02899 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58987 ms - Host latency: 4.73836 ms (end to end 4.75505 ms, enqueue 1.01709 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59006 ms - Host latency: 4.73572 ms (end to end 4.75276 ms, enqueue 1.02136 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59048 ms - Host latency: 4.71992 ms (end to end 4.73885 ms, enqueue 1.02228 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59038 ms - Host latency: 4.70565 ms (end to end 4.72057 ms, enqueue 1.07745 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59613 ms - Host latency: 4.654 ms (end to end 4.66982 ms, enqueue 1.64631 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60891 ms - Host latency: 4.6658 ms (end to end 4.68058 ms, enqueue 1.64453 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.63901 ms - Host latency: 4.72241 ms (end to end 4.74214 ms, enqueue 1.34059 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.62897 ms - Host latency: 4.68709 ms (end to end 4.69999 ms, enqueue 1.66216 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.63082 ms - Host latency: 4.70751 ms (end to end 4.7218 ms, enqueue 1.45334 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.62992 ms - Host latency: 4.6874 ms (end to end 4.70267 ms, enqueue 1.64911 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.622 ms - Host latency: 4.73571 ms (end to end 4.75325 ms, enqueue 1.20652 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59867 ms - Host latency: 4.6564 ms (end to end 4.67043 ms, enqueue 1.59722 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60093 ms - Host latency: 4.65856 ms (end to end 4.67501 ms, enqueue 1.66334 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60172 ms - Host latency: 4.72034 ms (end to end 4.73595 ms, enqueue 1.27314 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60275 ms - Host latency: 4.7422 ms (end to end 4.76001 ms, enqueue 1.03055 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60154 ms - Host latency: 4.75237 ms (end to end 4.76968 ms, enqueue 1.01521 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60103 ms - Host latency: 4.65785 ms (end to end 4.67402 ms, enqueue 1.57283 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59702 ms - Host latency: 4.65447 ms (end to end 4.66899 ms, enqueue 1.6537 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60101 ms - Host latency: 4.66719 ms (end to end 4.68365 ms, enqueue 1.57606 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60033 ms - Host latency: 4.66338 ms (end to end 4.67982 ms, enqueue 1.52695 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.61044 ms - Host latency: 4.6709 ms (end to end 4.68477 ms, enqueue 1.65308 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.61915 ms - Host latency: 4.75122 ms (end to end 4.76687 ms, enqueue 1.15017 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60728 ms - Host latency: 4.74371 ms (end to end 4.76132 ms, enqueue 1.03044 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59255 ms - Host latency: 4.72791 ms (end to end 4.74779 ms, enqueue 1.03347 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59315 ms - Host latency: 4.74182 ms (end to end 4.75947 ms, enqueue 1.01835 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59058 ms - Host latency: 4.73859 ms (end to end 4.75806 ms, enqueue 1.01575 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59283 ms - Host latency: 4.73408 ms (end to end 4.75116 ms, enqueue 1.02853 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59253 ms - Host latency: 4.73284 ms (end to end 4.7496 ms, enqueue 1.0173 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59139 ms - Host latency: 4.73563 ms (end to end 4.7526 ms, enqueue 1.01703 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59039 ms - Host latency: 4.68552 ms (end to end 4.70142 ms, enqueue 1.15013 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58688 ms - Host latency: 4.64351 ms (end to end 4.65852 ms, enqueue 1.65355 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59246 ms - Host latency: 4.78259 ms (end to end 4.79854 ms, enqueue 0.765063 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58976 ms - Host latency: 4.79293 ms (end to end 4.80812 ms, enqueue 0.447778 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59243 ms - Host latency: 4.72633 ms (end to end 4.74291 ms, enqueue 1.14955 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58927 ms - Host latency: 4.70409 ms (end to end 4.71877 ms, enqueue 1.46211 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58922 ms - Host latency: 4.69727 ms (end to end 4.71404 ms, enqueue 1.4674 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60126 ms - Host latency: 4.71378 ms (end to end 4.72882 ms, enqueue 1.4665 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.6146 ms - Host latency: 4.72861 ms (end to end 4.74229 ms, enqueue 1.46687 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.61904 ms - Host latency: 4.73428 ms (end to end 4.75139 ms, enqueue 1.45776 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.6167 ms - Host latency: 4.72507 ms (end to end 4.7394 ms, enqueue 1.46343 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.6165 ms - Host latency: 4.72825 ms (end to end 4.74551 ms, enqueue 1.48093 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.61758 ms - Host latency: 4.72815 ms (end to end 4.74431 ms, enqueue 1.47295 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60859 ms - Host latency: 4.727 ms (end to end 4.74077 ms, enqueue 1.45435 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.6021 ms - Host latency: 4.71687 ms (end to end 4.73274 ms, enqueue 1.45869 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.61111 ms - Host latency: 4.72588 ms (end to end 4.73958 ms, enqueue 1.46362 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.6085 ms - Host latency: 4.71299 ms (end to end 4.72961 ms, enqueue 1.4863 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.62021 ms - Host latency: 4.73657 ms (end to end 4.75117 ms, enqueue 1.46689 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.61001 ms - Host latency: 4.7217 ms (end to end 4.73774 ms, enqueue 1.47329 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60894 ms - Host latency: 4.72175 ms (end to end 4.73774 ms, enqueue 1.45996 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59602 ms - Host latency: 4.69124 ms (end to end 4.70601 ms, enqueue 1.48582 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58879 ms - Host latency: 4.7061 ms (end to end 4.72107 ms, enqueue 1.45811 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59341 ms - Host latency: 4.7093 ms (end to end 4.72632 ms, enqueue 1.46155 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59861 ms - Host latency: 4.67756 ms (end to end 4.69421 ms, enqueue 1.54897 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59929 ms - Host latency: 4.65381 ms (end to end 4.66875 ms, enqueue 1.64392 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60144 ms - Host latency: 4.71389 ms (end to end 4.73044 ms, enqueue 1.3313 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59775 ms - Host latency: 4.71812 ms (end to end 4.73245 ms, enqueue 1.02263 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.57788 ms - Host latency: 4.66929 ms (end to end 4.68704 ms, enqueue 1.26707 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.58318 ms - Host latency: 4.64211 ms (end to end 4.6571 ms, enqueue 1.6553 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59839 ms - Host latency: 4.6543 ms (end to end 4.66938 ms, enqueue 1.65542 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59526 ms - Host latency: 4.66873 ms (end to end 4.68474 ms, enqueue 1.57432 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60022 ms - Host latency: 4.74575 ms (end to end 4.76467 ms, enqueue 1.02512 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59861 ms - Host latency: 4.725 ms (end to end 4.74438 ms, enqueue 1.03474 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60442 ms - Host latency: 4.74048 ms (end to end 4.75903 ms, enqueue 1.02407 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60613 ms - Host latency: 4.74568 ms (end to end 4.76294 ms, enqueue 1.02964 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60151 ms - Host latency: 4.74846 ms (end to end 4.76499 ms, enqueue 1.01465 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60273 ms - Host latency: 4.7436 ms (end to end 4.76155 ms, enqueue 1.02131 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59956 ms - Host latency: 4.73704 ms (end to end 4.75496 ms, enqueue 1.02078 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60122 ms - Host latency: 4.74536 ms (end to end 4.76064 ms, enqueue 1.02913 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60396 ms - Host latency: 4.75247 ms (end to end 4.77131 ms, enqueue 1.0165 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60598 ms - Host latency: 4.74436 ms (end to end 4.76189 ms, enqueue 1.01392 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.59995 ms - Host latency: 4.71816 ms (end to end 4.73706 ms, enqueue 1.02988 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.60974 ms - Host latency: 4.7179 ms (end to end 4.73477 ms, enqueue 1.07554 ms)\n", - "[07/25/2022-16:44:24] [I] Average on 10 runs - GPU latency: 1.61443 ms - Host latency: 4.66958 ms (end to end 4.68594 ms, enqueue 1.65239 ms)\n", - "[07/25/2022-16:44:24] [I] \n", - "[07/25/2022-16:44:24] [I] === Performance summary ===\n", - "[07/25/2022-16:44:24] [I] Throughput: 321.422 qps\n", - "[07/25/2022-16:44:24] [I] Latency: min = 4.61383 ms, max = 5.11646 ms, mean = 4.71056 ms, median = 4.71863 ms, percentile(99%) = 4.80322 ms\n", - "[07/25/2022-16:44:24] [I] End-to-End Host Latency: min = 4.62366 ms, max = 5.13928 ms, mean = 4.72723 ms, median = 4.73462 ms, percentile(99%) = 4.81934 ms\n", - "[07/25/2022-16:44:24] [I] Enqueue Time: min = 0.337158 ms, max = 1.83459 ms, mean = 1.28084 ms, median = 1.0896 ms, percentile(99%) = 1.71924 ms\n", - "[07/25/2022-16:44:24] [I] H2D Latency: min = 3.01642 ms, max = 3.51599 ms, mean = 3.09767 ms, median = 3.10742 ms, percentile(99%) = 3.1925 ms\n", - "[07/25/2022-16:44:24] [I] GPU Compute Time: min = 1.56671 ms, max = 1.6599 ms, mean = 1.59911 ms, median = 1.59741 ms, percentile(99%) = 1.63635 ms\n", - "[07/25/2022-16:44:24] [I] D2H Latency: min = 0.00561523 ms, max = 0.0314941 ms, mean = 0.0137833 ms, median = 0.0134277 ms, percentile(99%) = 0.0292969 ms\n", - "[07/25/2022-16:44:24] [I] Total Host Walltime: 3.00851 s\n", - "[07/25/2022-16:44:24] [I] Total GPU Compute Time: 1.54634 s\n", - "[07/25/2022-16:44:24] [W] * Throughput may be bound by host-to-device transfers for the inputs rather than GPU Compute and the GPU may be under-utilized.\n", - "[07/25/2022-16:44:24] [W] Add --noDataTransfers flag to disable data transfers.\n", - "[07/25/2022-16:44:24] [I] Explanations of the performance metrics are printed in the verbose logs.\n", - "[07/25/2022-16:44:24] [I] \n", - "&&&& PASSED TensorRT.trtexec [TensorRT v8205] # trtexec --onnx=models/mobilenetv2_ptq.onnx --int8 --saveEngine=models/mobilenetv2_ptq.trt\n" - ] - } - ], - "source": [ - "# Set static member of TensorQuantizer to use Pytorch’s own fake quantization functions\n", - "quant_nn.TensorQuantizer.use_fb_fake_quant = True\n", - "\n", - "# Exporting to ONNX\n", - "dummy_input = torch.randn(64, 3, 224, 224, device='cuda')\n", - "input_names = [ \"actual_input_1\" ]\n", - "output_names = [ \"output1\" ]\n", - "torch.onnx.export(\n", - " q_model,\n", - " dummy_input,\n", - " \"models/mobilenetv2_ptq.onnx\",\n", - " verbose=False,\n", - " opset_version=13,\n", - " do_constant_folding = False)\n", - "\n", - "# Converting ONNX model to TRT\n", - "!trtexec --onnx=models/mobilenetv2_ptq.onnx --int8 --saveEngine=models/mobilenetv2_ptq.trt" - ] - }, - { - "attachments": { - "img5.JPG": { - "image/jpeg": "" - } - }, - "cell_type": "markdown", - "id": "d3e676e7", - "metadata": {}, - "source": [ - "\n", - "## 5. Quantization Aware Training (QAT)\n", - "\n", - "PTQ resulted in a ~3% accuracy drop. After PTQ is performed, sometimes the model may perform poorly by not retaining the accuracy as the process is not able to mitigate the large quantization error induced by low-bit quantization. This could happen if there are sensitive layers in the network, like the Depth wise convolutional networks, in MobileNets which are more susceptible to producing larger quantization error. \n", - "\n", - "This is when we might want to consider using QAT. The idea behind QAT is simple: you can improve the lost accuracy of the quantized model, if you had trained the model with quantization error. There are many ways of doing this, starting the training of the model from scratch or fine-tuning a pre-trained model. Whatever method you choose, the quantization error is induced in the training loss by inserting fake-quantization operations. The operation is called “fake” because we quantize the data and immediately perform a dequantize operation producing an approximate version of the data where both input and output still remain as floating point values. We are here trying to simulate the effects of quantization without changing much in the model. \n", - "In the forward-pass, we fake-quantize the weights and activations and use these fake-quantized outputs to perform the layer operations.\n", - "\n", - "![img5.JPG](attachment:img5.JPG)\n", - "\n", - "In the backward pass, while calculating gradient, the quantization operation’s derivative is undefined at the step boundaries, and zero everywhere else. To handle this, QAT uses Straight-through Estimator by approximating the derivative to be 1 for inputs in the representable range. This estimator is essentially letting gradients pass as is through this operator in the backward pass. When the QAT process is done, the scales that were used to quantize the weights and activations are stored in the model and can be used for inference. " - ] - }, - { - "cell_type": "markdown", - "id": "bcc10e0f", - "metadata": {}, - "source": [ - "Usually the finetuning of QAT model should be quick compared to the full training of the original model. For this Mobilenetv2 model, it is enough to finetune for 2 epochs to get acceptable accuracy. \n", - "\n", - "tensor_quant function in `pytorch_quantization` toolkit is responsible for the above tensor quantization. Usually, per channel quantization is recommended for weights, while per tensor quantization is recommended for activations in a network.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "dc144132", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Epoch: [ 1 / 2] LR: 0.000100\n", - "Batch: [ 100 | 147] loss: 1.806\n", - "Test Acc: 69.88%\n", - "Epoch: [ 2 / 2] LR: 0.000100\n", - "Batch: [ 100 | 147] loss: 1.800\n", - "Test Acc: 69.49%\n", - "Checkpoint saved\n" - ] - } - ], - "source": [ - "# Finetune the QAT model for 2 epochs\n", - "num_epochs=2\n", - "\n", - "for epoch in range(num_epochs):\n", - " print('Epoch: [%5d / %5d] LR: %f' % (epoch + 1, num_epochs, lr))\n", - "\n", - " train(q_model, train_dataloader, criterion, optimizer, epoch)\n", - " test_acc = evaluate(q_model, val_dataloader, criterion, epoch)\n", - "\n", - " print(\"Test Acc: {:.2f}%\".format(100 * test_acc))\n", - " \n", - "save_checkpoint({'epoch': epoch + 1,\n", - " 'model_state_dict': q_model.state_dict(),\n", - " 'acc': test_acc,\n", - " 'opt_state_dict': optimizer.state_dict()\n", - " },\n", - " ckpt_path=\"models/mobilenetv2_qat_ckpt\")" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "0d801c67", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Mobilenetv2 QAT accuracy: 69.49%\n" - ] - } - ], - "source": [ - "# Evaluate the QAT model\n", - "test_acc = evaluate(q_model, val_dataloader, criterion, 0)\n", - "print(\"Mobilenetv2 QAT accuracy: {:.2f}%\".format(100 * test_acc))" - ] - }, - { - "cell_type": "markdown", - "id": "70bdaeed", - "metadata": {}, - "source": [ - "As you can see, accuracy recovered by ~1.3%. Fine-tuning for more epochs with learning rate annealing can improve accuracy further. It should be noted that the same fine-tuning schedule will improve the accuracy of the unquantized model as well. Please refer to Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT for detailed recommendations.\n", - "\n", - "During inference, we use `torch.fake_quantize_per_tensor_affine` and `torch.fake_quantize_per_channel_affine` to perform quantization as this is easier to convert into corresponding TensorRT operators. \n", - "\n", - "Let us now prepare this model to export into ONNX. " - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "176a6bfd", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "&&&& RUNNING TensorRT.trtexec [TensorRT v8205] # trtexec --onnx=models/mobilenetv2_qat.onnx --int8 --saveEngine=models/mobilenetv2_qat.trt\n", - "[07/25/2022-16:46:43] [I] === Model Options ===\n", - "[07/25/2022-16:46:43] [I] Format: ONNX\n", - "[07/25/2022-16:46:43] [I] Model: models/mobilenetv2_qat.onnx\n", - "[07/25/2022-16:46:43] [I] Output:\n", - "[07/25/2022-16:46:43] [I] === Build Options ===\n", - "[07/25/2022-16:46:43] [I] Max batch: explicit batch\n", - "[07/25/2022-16:46:43] [I] Workspace: 16 MiB\n", - "[07/25/2022-16:46:43] [I] minTiming: 1\n", - "[07/25/2022-16:46:43] [I] avgTiming: 8\n", - "[07/25/2022-16:46:43] [I] Precision: FP32+INT8\n", - "[07/25/2022-16:46:43] [I] Calibration: Dynamic\n", - "[07/25/2022-16:46:43] [I] Refit: Disabled\n", - "[07/25/2022-16:46:43] [I] Sparsity: Disabled\n", - "[07/25/2022-16:46:43] [I] Safe mode: Disabled\n", - "[07/25/2022-16:46:43] [I] DirectIO mode: Disabled\n", - "[07/25/2022-16:46:43] [I] Restricted mode: Disabled\n", - "[07/25/2022-16:46:43] [I] Save engine: models/mobilenetv2_qat.trt\n", - "[07/25/2022-16:46:43] [I] Load engine: \n", - "[07/25/2022-16:46:43] [I] Profiling verbosity: 0\n", - "[07/25/2022-16:46:43] [I] Tactic sources: Using default tactic sources\n", - "[07/25/2022-16:46:43] [I] timingCacheMode: local\n", - "[07/25/2022-16:46:43] [I] timingCacheFile: \n", - "[07/25/2022-16:46:43] [I] Input(s)s format: fp32:CHW\n", - "[07/25/2022-16:46:43] [I] Output(s)s format: fp32:CHW\n", - "[07/25/2022-16:46:43] [I] Input build shapes: model\n", - "[07/25/2022-16:46:43] [I] Input calibration shapes: model\n", - "[07/25/2022-16:46:43] [I] === System Options ===\n", - "[07/25/2022-16:46:43] [I] Device: 0\n", - "[07/25/2022-16:46:43] [I] DLACore: \n", - "[07/25/2022-16:46:43] [I] Plugins:\n", - "[07/25/2022-16:46:43] [I] === Inference Options ===\n", - "[07/25/2022-16:46:43] [I] Batch: Explicit\n", - "[07/25/2022-16:46:43] [I] Input inference shapes: model\n", - "[07/25/2022-16:46:43] [I] Iterations: 10\n", - "[07/25/2022-16:46:43] [I] Duration: 3s (+ 200ms warm up)\n", - "[07/25/2022-16:46:43] [I] Sleep time: 0ms\n", - "[07/25/2022-16:46:43] [I] Idle time: 0ms\n", - "[07/25/2022-16:46:43] [I] Streams: 1\n", - "[07/25/2022-16:46:43] [I] ExposeDMA: Disabled\n", - "[07/25/2022-16:46:43] [I] Data transfers: Enabled\n", - "[07/25/2022-16:46:43] [I] Spin-wait: Disabled\n", - "[07/25/2022-16:46:43] [I] Multithreading: Disabled\n", - "[07/25/2022-16:46:43] [I] CUDA Graph: Disabled\n", - "[07/25/2022-16:46:43] [I] Separate profiling: Disabled\n", - "[07/25/2022-16:46:43] [I] Time Deserialize: Disabled\n", - "[07/25/2022-16:46:43] [I] Time Refit: Disabled\n", - "[07/25/2022-16:46:43] [I] Skip inference: Disabled\n", - "[07/25/2022-16:46:43] [I] Inputs:\n", - "[07/25/2022-16:46:43] [I] === Reporting Options ===\n", - "[07/25/2022-16:46:43] [I] Verbose: Disabled\n", - "[07/25/2022-16:46:43] [I] Averages: 10 inferences\n", - "[07/25/2022-16:46:43] [I] Percentile: 99\n", - "[07/25/2022-16:46:43] [I] Dump refittable layers:Disabled\n", - "[07/25/2022-16:46:43] [I] Dump output: Disabled\n", - "[07/25/2022-16:46:43] [I] Profile: Disabled\n", - "[07/25/2022-16:46:43] [I] Export timing to JSON file: \n", - "[07/25/2022-16:46:43] [I] Export output to JSON file: \n", - "[07/25/2022-16:46:43] [I] Export profile to JSON file: \n", - "[07/25/2022-16:46:43] [I] \n", - "[07/25/2022-16:46:43] [I] === Device Information ===\n", - "[07/25/2022-16:46:43] [I] Selected Device: NVIDIA Graphics Device\n", - "[07/25/2022-16:46:43] [I] Compute Capability: 8.0\n", - "[07/25/2022-16:46:43] [I] SMs: 124\n", - "[07/25/2022-16:46:43] [I] Compute Clock Rate: 1.005 GHz\n", - "[07/25/2022-16:46:43] [I] Device Global Memory: 47681 MiB\n", - "[07/25/2022-16:46:43] [I] Shared Memory per SM: 164 KiB\n", - "[07/25/2022-16:46:43] [I] Memory Bus Width: 6144 bits (ECC enabled)\n", - "[07/25/2022-16:46:43] [I] Memory Clock Rate: 1.215 GHz\n", - "[07/25/2022-16:46:43] [I] \n", - "[07/25/2022-16:46:43] [I] TensorRT version: 8.2.5\n", - "[07/25/2022-16:46:44] [I] [TRT] [MemUsageChange] Init CUDA: CPU +440, GPU +0, now: CPU 452, GPU 5862 (MiB)\n", - "[07/25/2022-16:46:44] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 452 MiB, GPU 5862 MiB\n", - "[07/25/2022-16:46:44] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 669 MiB, GPU 5934 MiB\n", - "[07/25/2022-16:46:44] [I] Start parsing network model\n", - "[07/25/2022-16:46:44] [I] [TRT] ----------------------------------------------------------------\n", - "[07/25/2022-16:46:44] [I] [TRT] Input filename: models/mobilenetv2_qat.onnx\n", - "[07/25/2022-16:46:44] [I] [TRT] ONNX IR version: 0.0.7\n", - "[07/25/2022-16:46:44] [I] [TRT] Opset version: 13\n", - "[07/25/2022-16:46:44] [I] [TRT] Producer name: pytorch\n", - "[07/25/2022-16:46:44] [I] [TRT] Producer version: 1.13.0\n", - "[07/25/2022-16:46:44] [I] [TRT] Domain: \n", - "[07/25/2022-16:46:44] [I] [TRT] Model version: 0\n", - "[07/25/2022-16:46:44] [I] [TRT] Doc string: \n", - "[07/25/2022-16:46:44] [I] [TRT] ----------------------------------------------------------------\n", - "[07/25/2022-16:46:44] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:506: Your ONNX model has been generated with double-typed weights, while TensorRT does not natively support double. Attempting to cast down to float.\n", - "[07/25/2022-16:46:44] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:368: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.\n", - "[07/25/2022-16:46:45] [I] Finish parsing network model\n", - "[07/25/2022-16:46:45] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best\n", - "[07/25/2022-16:46:45] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes.\n", - "[07/25/2022-16:46:47] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +838, GPU +362, now: CPU 1543, GPU 6342 (MiB)\n", - "[07/25/2022-16:46:47] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +128, GPU +58, now: CPU 1671, GPU 6400 (MiB)\n", - "[07/25/2022-16:46:47] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.\n", - "[07/25/2022-16:47:09] [I] [TRT] Detected 1 inputs and 1 output network tensors.\n", - "[07/25/2022-16:47:09] [I] [TRT] Total Host Persistent Memory: 82480\n", - "[07/25/2022-16:47:09] [I] [TRT] Total Device Persistent Memory: 2413056\n", - "[07/25/2022-16:47:09] [I] [TRT] Total Scratch Memory: 0\n", - "[07/25/2022-16:47:09] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 11 MiB, GPU 184 MiB\n", - "[07/25/2022-16:47:09] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 3.32319ms to assign 4 blocks to 84 nodes requiring 130056192 bytes.\n", - "[07/25/2022-16:47:09] [I] [TRT] Total Activation Memory: 130056192\n", - "[07/25/2022-16:47:09] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1674, GPU 6412 (MiB)\n", - "[07/25/2022-16:47:09] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1674, GPU 6422 (MiB)\n", - "[07/25/2022-16:47:09] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +2, GPU +4, now: CPU 2, GPU 4 (MiB)\n", - "[07/25/2022-16:47:09] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1665, GPU 6384 (MiB)\n", - "[07/25/2022-16:47:09] [I] [TRT] Loaded engine size: 2 MiB\n", - "[07/25/2022-16:47:09] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1666, GPU 6398 (MiB)\n", - "[07/25/2022-16:47:09] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1666, GPU 6406 (MiB)\n", - "[07/25/2022-16:47:09] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +2, now: CPU 0, GPU 2 (MiB)\n", - "[07/25/2022-16:47:09] [I] Engine built in 25.2523 sec.\n", - "[07/25/2022-16:47:09] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1435, GPU 6322 (MiB)\n", - "[07/25/2022-16:47:09] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 1436, GPU 6330 (MiB)\n", - "[07/25/2022-16:47:09] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +126, now: CPU 0, GPU 128 (MiB)\n", - "[07/25/2022-16:47:09] [I] Using random values for input inputs.1\n", - "[07/25/2022-16:47:09] [I] Created input binding for inputs.1 with dimensions 64x3x224x224\n", - "[07/25/2022-16:47:09] [I] Using random values for output 1225\n", - "[07/25/2022-16:47:09] [I] Created output binding for 1225 with dimensions 64x10\n", - "[07/25/2022-16:47:09] [I] Starting inference\n", - "[07/25/2022-16:47:12] [I] Warmup completed 63 queries over 200 ms\n", - "[07/25/2022-16:47:12] [I] Timing trace has 976 queries over 3.0073 s\n", - "[07/25/2022-16:47:12] [I] \n", - "[07/25/2022-16:47:12] [I] === Trace details ===\n", - "[07/25/2022-16:47:12] [I] Trace averages of 10 runs:\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.92225 ms - Host latency: 5.03344 ms (end to end 5.05219 ms, enqueue 1.40172 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.66963 ms - Host latency: 4.78574 ms (end to end 4.80028 ms, enqueue 1.39754 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61669 ms - Host latency: 4.73438 ms (end to end 4.75002 ms, enqueue 1.40104 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59776 ms - Host latency: 4.70923 ms (end to end 4.72325 ms, enqueue 1.40551 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59918 ms - Host latency: 4.715 ms (end to end 4.72859 ms, enqueue 1.39258 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59212 ms - Host latency: 4.70311 ms (end to end 4.71815 ms, enqueue 1.40127 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59171 ms - Host latency: 4.70111 ms (end to end 4.71709 ms, enqueue 1.3924 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.58884 ms - Host latency: 4.69999 ms (end to end 4.71507 ms, enqueue 1.38793 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59119 ms - Host latency: 4.70641 ms (end to end 4.72385 ms, enqueue 1.39411 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.58546 ms - Host latency: 4.70263 ms (end to end 4.7179 ms, enqueue 1.39454 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.58618 ms - Host latency: 4.69799 ms (end to end 4.71401 ms, enqueue 1.38189 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59365 ms - Host latency: 4.70694 ms (end to end 4.72247 ms, enqueue 1.40284 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59426 ms - Host latency: 4.70533 ms (end to end 4.71981 ms, enqueue 1.40167 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59406 ms - Host latency: 4.70507 ms (end to end 4.72038 ms, enqueue 1.39868 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59302 ms - Host latency: 4.70604 ms (end to end 4.72096 ms, enqueue 1.39022 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.5956 ms - Host latency: 4.70856 ms (end to end 4.72499 ms, enqueue 1.39016 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59622 ms - Host latency: 4.71029 ms (end to end 4.72501 ms, enqueue 1.39351 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59784 ms - Host latency: 4.70826 ms (end to end 4.72278 ms, enqueue 1.39263 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59805 ms - Host latency: 4.71088 ms (end to end 4.72592 ms, enqueue 1.39367 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59795 ms - Host latency: 4.71144 ms (end to end 4.72837 ms, enqueue 1.3975 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59713 ms - Host latency: 4.70555 ms (end to end 4.72311 ms, enqueue 1.40206 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59601 ms - Host latency: 4.68881 ms (end to end 4.70304 ms, enqueue 1.3645 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.58742 ms - Host latency: 4.69799 ms (end to end 4.71174 ms, enqueue 1.39108 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59344 ms - Host latency: 4.70665 ms (end to end 4.72214 ms, enqueue 1.39278 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59734 ms - Host latency: 4.70482 ms (end to end 4.71854 ms, enqueue 1.39332 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59714 ms - Host latency: 4.70997 ms (end to end 4.72628 ms, enqueue 1.40047 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61176 ms - Host latency: 4.72535 ms (end to end 4.7418 ms, enqueue 1.39706 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61494 ms - Host latency: 4.72816 ms (end to end 4.7448 ms, enqueue 1.39434 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61383 ms - Host latency: 4.72913 ms (end to end 4.7439 ms, enqueue 1.40642 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61697 ms - Host latency: 4.73928 ms (end to end 4.75625 ms, enqueue 1.41578 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61782 ms - Host latency: 4.83635 ms (end to end 4.85382 ms, enqueue 0.316187 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61688 ms - Host latency: 4.81012 ms (end to end 4.82694 ms, enqueue 0.524707 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62682 ms - Host latency: 4.69824 ms (end to end 4.71261 ms, enqueue 1.44248 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62582 ms - Host latency: 4.68247 ms (end to end 4.69834 ms, enqueue 1.57075 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62538 ms - Host latency: 4.68074 ms (end to end 4.69913 ms, enqueue 1.56764 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62548 ms - Host latency: 4.68276 ms (end to end 4.69795 ms, enqueue 1.58025 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62765 ms - Host latency: 4.68287 ms (end to end 4.70229 ms, enqueue 1.56355 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62581 ms - Host latency: 4.68279 ms (end to end 4.69857 ms, enqueue 1.57596 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62439 ms - Host latency: 4.68186 ms (end to end 4.69902 ms, enqueue 1.56841 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62468 ms - Host latency: 4.6818 ms (end to end 4.69666 ms, enqueue 1.57666 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62562 ms - Host latency: 4.68257 ms (end to end 4.6985 ms, enqueue 1.57379 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61575 ms - Host latency: 4.67201 ms (end to end 4.68948 ms, enqueue 1.58751 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61467 ms - Host latency: 4.67125 ms (end to end 4.68734 ms, enqueue 1.57214 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.6139 ms - Host latency: 4.66783 ms (end to end 4.6828 ms, enqueue 1.56377 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61342 ms - Host latency: 4.67017 ms (end to end 4.68673 ms, enqueue 1.57308 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61005 ms - Host latency: 4.66664 ms (end to end 4.68411 ms, enqueue 1.55513 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.59465 ms - Host latency: 4.65076 ms (end to end 4.66672 ms, enqueue 1.56719 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.5959 ms - Host latency: 4.65466 ms (end to end 4.66882 ms, enqueue 1.5709 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.60471 ms - Host latency: 4.66272 ms (end to end 4.68046 ms, enqueue 1.58149 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61157 ms - Host latency: 4.66888 ms (end to end 4.68478 ms, enqueue 1.62261 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61403 ms - Host latency: 4.66865 ms (end to end 4.68436 ms, enqueue 1.61089 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61339 ms - Host latency: 4.66898 ms (end to end 4.6855 ms, enqueue 1.59581 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61229 ms - Host latency: 4.66919 ms (end to end 4.68688 ms, enqueue 1.57114 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61361 ms - Host latency: 4.67148 ms (end to end 4.68864 ms, enqueue 1.57201 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61329 ms - Host latency: 4.66671 ms (end to end 4.6823 ms, enqueue 1.56505 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61117 ms - Host latency: 4.66793 ms (end to end 4.68323 ms, enqueue 1.58344 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61322 ms - Host latency: 4.67312 ms (end to end 4.68901 ms, enqueue 1.57474 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61351 ms - Host latency: 4.6689 ms (end to end 4.68566 ms, enqueue 1.57411 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.6125 ms - Host latency: 4.67083 ms (end to end 4.68839 ms, enqueue 1.56761 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61216 ms - Host latency: 4.66829 ms (end to end 4.68427 ms, enqueue 1.57145 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61221 ms - Host latency: 4.66812 ms (end to end 4.68464 ms, enqueue 1.57742 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61724 ms - Host latency: 4.67236 ms (end to end 4.69009 ms, enqueue 1.58645 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.6342 ms - Host latency: 4.69334 ms (end to end 4.70886 ms, enqueue 1.58391 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.64475 ms - Host latency: 4.70205 ms (end to end 4.71633 ms, enqueue 1.57148 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.64463 ms - Host latency: 4.70203 ms (end to end 4.71699 ms, enqueue 1.56494 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.64092 ms - Host latency: 4.69741 ms (end to end 4.71147 ms, enqueue 1.57456 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62642 ms - Host latency: 4.68474 ms (end to end 4.70034 ms, enqueue 1.56938 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62737 ms - Host latency: 4.68528 ms (end to end 4.70254 ms, enqueue 1.57288 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62422 ms - Host latency: 4.68096 ms (end to end 4.69629 ms, enqueue 1.58088 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62236 ms - Host latency: 4.67939 ms (end to end 4.69592 ms, enqueue 1.56531 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61946 ms - Host latency: 4.67705 ms (end to end 4.69207 ms, enqueue 1.57915 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62383 ms - Host latency: 4.68113 ms (end to end 4.69565 ms, enqueue 1.56628 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62493 ms - Host latency: 4.68076 ms (end to end 4.69827 ms, enqueue 1.57712 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62881 ms - Host latency: 4.68533 ms (end to end 4.70332 ms, enqueue 1.59106 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62705 ms - Host latency: 4.77595 ms (end to end 4.79063 ms, enqueue 1.23335 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.63042 ms - Host latency: 4.83225 ms (end to end 4.84863 ms, enqueue 0.584692 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62356 ms - Host latency: 4.80049 ms (end to end 4.81941 ms, enqueue 0.722852 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61289 ms - Host latency: 4.70488 ms (end to end 4.72126 ms, enqueue 1.16353 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61414 ms - Host latency: 4.67012 ms (end to end 4.6865 ms, enqueue 1.55625 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61272 ms - Host latency: 4.66924 ms (end to end 4.68572 ms, enqueue 1.57039 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61147 ms - Host latency: 4.66743 ms (end to end 4.6821 ms, enqueue 1.57139 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61204 ms - Host latency: 4.66624 ms (end to end 4.68369 ms, enqueue 1.57068 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61245 ms - Host latency: 4.67002 ms (end to end 4.68525 ms, enqueue 1.56729 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61497 ms - Host latency: 4.67256 ms (end to end 4.68835 ms, enqueue 1.5822 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61396 ms - Host latency: 4.6707 ms (end to end 4.6873 ms, enqueue 1.56724 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61487 ms - Host latency: 4.67173 ms (end to end 4.68682 ms, enqueue 1.57334 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61299 ms - Host latency: 4.66936 ms (end to end 4.68381 ms, enqueue 1.57117 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61013 ms - Host latency: 4.66755 ms (end to end 4.68381 ms, enqueue 1.57551 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61992 ms - Host latency: 4.67517 ms (end to end 4.69097 ms, enqueue 1.58848 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62769 ms - Host latency: 4.6877 ms (end to end 4.70227 ms, enqueue 1.57029 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62732 ms - Host latency: 4.68355 ms (end to end 4.70088 ms, enqueue 1.56836 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.62974 ms - Host latency: 4.6852 ms (end to end 4.69971 ms, enqueue 1.56511 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61794 ms - Host latency: 4.67524 ms (end to end 4.68911 ms, enqueue 1.57212 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.61436 ms - Host latency: 4.6708 ms (end to end 4.68591 ms, enqueue 1.5667 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.60833 ms - Host latency: 4.66543 ms (end to end 4.68132 ms, enqueue 1.57961 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.6125 ms - Host latency: 4.66885 ms (end to end 4.68545 ms, enqueue 1.56494 ms)\n", - "[07/25/2022-16:47:12] [I] Average on 10 runs - GPU latency: 1.63328 ms - Host latency: 4.69219 ms (end to end 4.70671 ms, enqueue 1.57573 ms)\n", - "[07/25/2022-16:47:12] [I] \n", - "[07/25/2022-16:47:12] [I] === Performance summary ===\n", - "[07/25/2022-16:47:12] [I] Throughput: 324.544 qps\n", - "[07/25/2022-16:47:12] [I] Latency: min = 4.63513 ms, max = 5.62218 ms, mean = 4.69772 ms, median = 4.68481 ms, percentile(99%) = 4.86353 ms\n", - "[07/25/2022-16:47:12] [I] End-to-End Host Latency: min = 4.64392 ms, max = 5.64146 ms, mean = 4.71364 ms, median = 4.70197 ms, percentile(99%) = 4.88013 ms\n", - "[07/25/2022-16:47:12] [I] Enqueue Time: min = 0.310181 ms, max = 4.23633 ms, mean = 1.46804 ms, median = 1.5567 ms, percentile(99%) = 1.67847 ms\n", - "[07/25/2022-16:47:12] [I] H2D Latency: min = 3.01538 ms, max = 3.23657 ms, mean = 3.06713 ms, median = 3.05371 ms, percentile(99%) = 3.20923 ms\n", - "[07/25/2022-16:47:12] [I] GPU Compute Time: min = 1.578 ms, max = 2.49139 ms, mean = 1.61667 ms, median = 1.61377 ms, percentile(99%) = 1.69678 ms\n", - "[07/25/2022-16:47:12] [I] D2H Latency: min = 0.00561523 ms, max = 0.0319824 ms, mean = 0.0139259 ms, median = 0.0134277 ms, percentile(99%) = 0.0289307 ms\n", - "[07/25/2022-16:47:12] [I] Total Host Walltime: 3.0073 s\n", - "[07/25/2022-16:47:12] [I] Total GPU Compute Time: 1.57787 s\n", - "[07/25/2022-16:47:12] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.\n", - "[07/25/2022-16:47:12] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.\n", - "[07/25/2022-16:47:12] [W] * Throughput may be bound by host-to-device transfers for the inputs rather than GPU Compute and the GPU may be under-utilized.\n", - "[07/25/2022-16:47:12] [W] Add --noDataTransfers flag to disable data transfers.\n", - "[07/25/2022-16:47:12] [I] Explanations of the performance metrics are printed in the verbose logs.\n", - "[07/25/2022-16:47:12] [I] \n", - "&&&& PASSED TensorRT.trtexec [TensorRT v8205] # trtexec --onnx=models/mobilenetv2_qat.onnx --int8 --saveEngine=models/mobilenetv2_qat.trt\n" - ] - } - ], - "source": [ - "# Set static member of TensorQuantizer to use Pytorch’s own fake quantization functions\n", - "quant_nn.TensorQuantizer.use_fb_fake_quant = True\n", - "\n", - "# Exporting to ONNX\n", - "dummy_input = torch.randn(64, 3, 224, 224, device='cuda')\n", - "input_names = [ \"actual_input_1\" ]\n", - "output_names = [ \"output1\" ]\n", - "torch.onnx.export(\n", - " q_model,\n", - " dummy_input,\n", - " \"models/mobilenetv2_qat.onnx\",\n", - " verbose=False,\n", - " opset_version=13,\n", - " do_constant_folding = False)\n", - "\n", - "# Converting ONNX model to TRT\n", - "!trtexec --onnx=models/mobilenetv2_qat.onnx --int8 --saveEngine=models/mobilenetv2_qat.trt" - ] - }, - { - "cell_type": "markdown", - "id": "b5108ef4", - "metadata": {}, - "source": [ - "\n", - "### 6. Evaluation and Benchmarking" - ] - }, - { - "cell_type": "markdown", - "id": "2e5362ca", - "metadata": {}, - "source": [ - "Now, we have converted our model to a TensorRT engine. Great! That means we are ready to load it into the native Python TensorRT runtime to perform inference and evaluate our models." - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "790d73a6", - "metadata": {}, - "outputs": [], - "source": [ - "# Import needed libraries and define the evaluate function\n", - "\n", - "import pycuda.driver as cuda\n", - "import pycuda.autoinit\n", - "import time \n", - "\n", - "def evaluate_trt(engine_path, dataloader, batch_size):\n", - " \n", - " def predict(batch): # result gets copied into output\n", - " # transfer input data to device\n", - " cuda.memcpy_htod_async(d_input, batch, stream)\n", - " # execute model\n", - " context.execute_async_v2(bindings, stream.handle, None)\n", - " # transfer predictions back\n", - " cuda.memcpy_dtoh_async(output, d_output, stream)\n", - " # syncronize threads\n", - " stream.synchronize()\n", - " return output\n", - " \n", - " with open(engine_path, 'rb') as f, trt.Runtime(trt.Logger(trt.Logger.WARNING)) as runtime, runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:\n", - " total = 0\n", - " correct = 0\n", - " for images, labels in val_dataloader:\n", - " input_batch = images.numpy()\n", - " labels = labels.numpy()\n", - " output = np.empty([batch_size, 10], dtype = np.float32) \n", - "\n", - " # Now allocate input and output memory, give TRT pointers (bindings) to it:\n", - " d_input = cuda.mem_alloc(1 * input_batch.nbytes)\n", - " d_output = cuda.mem_alloc(1 * output.nbytes)\n", - " bindings = [int(d_input), int(d_output)]\n", - "\n", - " stream = cuda.Stream()\n", - " preds = predict(input_batch)\n", - " pred_labels = []\n", - " for pred in preds:\n", - " pred_label = (-pred).argsort()[0]\n", - " pred_labels.append(pred_label)\n", - "\n", - " total += len(labels)\n", - " correct += (pred_labels == labels).sum()\n", - " \n", - " return correct/total" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "f3fd416f", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Mobilenetv2 TRT Baseline accuracy: 71.13%\n" - ] - } - ], - "source": [ - "# Evaluate and benchmark the performance of the baseline TRT model (TRT FP32 Model)\n", - "batch_size = 64\n", - "test_acc = evaluate_trt(\"models/mobilenetv2_base.trt\", val_dataloader, batch_size)\n", - "print(\"Mobilenetv2 TRT Baseline accuracy: {:.2f}%\".format(100 * test_acc))" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "a5ec3a81", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Mobilenetv2 TRT PTQ accuracy: 68.11%\n" - ] - } - ], - "source": [ - "# Evaluate the PTQ model\n", - "batch_size = 64\n", - "test_acc = evaluate_trt(\"models/mobilenetv2_ptq.trt\", val_dataloader, batch_size)\n", - "print(\"Mobilenetv2 TRT PTQ accuracy: {:.2f}%\".format(100 * test_acc))" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "id": "eb95977d", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Mobilenetv2 TRT PTQ accuracy: 70.31%\n" - ] - } - ], - "source": [ - "# Evaluate the QAT model\n", - "batch_size = 64\n", - "test_acc = evaluate_trt(\"models/mobilenetv2_qat.trt\", val_dataloader, batch_size)\n", - "print(\"Mobilenetv2 TRT PTQ accuracy: {:.2f}%\".format(100 * test_acc))" - ] - }, - { - "cell_type": "markdown", - "id": "20c82807", - "metadata": {}, - "source": [ - "Compared to the TRT FP32 model, we observe a speedup of ~3.7x with only a ~0.8% loss in accuracy. " - ] - }, - { - "cell_type": "markdown", - "id": "52f311fb", - "metadata": {}, - "source": [ - "\n", - "## 7. Conclusion\n", - "We put together all the observations that were made in this notebook. Note that, these numbers can vary with every run due to the stochastic nature of the training process, but a similar pattern can still be noticed.\n", - "\n", - "| Model | Accuracy | Performance |\n", - "| ------------------------ | -------- | ----------- |\n", - "| Baseline MobileNetv2 | 71.11% | 11.92ms |\n", - "| Base + TRT
(TRT FP32) | 71.13% | 5.95ms |\n", - "| PTQ + TRT
(TRT int8) | 68.11% | 1.59ms |\n", - "| QAT+TRT
(TRT INT8) | 70.31% | 1.61ms |" - ] - }, - { - "cell_type": "markdown", - "id": "91dfc2c1", - "metadata": {}, - "source": [ - "\n", - "## 8. References\n", - "* Very Deep Convolution Networks for large scale Image Recognition\n", - "* Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT\n", - "* Pytorch-quantization toolkit from NVIDIA\n", - "* Pytorch quantization toolkit userguide\n", - "* Quantization basics\n", - "* TensorRT Developer Guide" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.13" - }, - "vscode": { - "interpreter": { - "hash": "b8290132a159428f0004735847c0b4016c8a5153e62fd80cc71ad5cd485f05b0" - } - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}