diff --git a/examples/bert/bert_auto_opt_gpu.json b/examples/bert/notebook/bert_auto_opt_gpu.json similarity index 95% rename from examples/bert/bert_auto_opt_gpu.json rename to examples/bert/notebook/bert_auto_opt_gpu.json index a2cb6468c..509735076 100644 --- a/examples/bert/bert_auto_opt_gpu.json +++ b/examples/bert/notebook/bert_auto_opt_gpu.json @@ -69,9 +69,10 @@ "host": "local_system", "target": "local_system", "cache_dir": "cache", + "packaging_config": { + "type": "Zipfile", + "name": "bert" + }, "output_dir": "models/bert_gpu" - }, - "auto_optimizer_config": { - "precisions": ["fp16"] } } diff --git a/examples/bert/notebook/multi_ep_search.ipynb b/examples/bert/notebook/multi_ep_search.ipynb new file mode 100644 index 000000000..fde3df323 --- /dev/null +++ b/examples/bert/notebook/multi_ep_search.ipynb @@ -0,0 +1,400 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to leverage Olive to search optimal optimization among different EPs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In many of the cases, customers are not familiar with the different EPs and their capabilities.\n", + "\n", + "For example:\n", + "1. With CUDAExecutionProvider, user cannot enable `trt_fp16_enable` in `PerfTuning`, but in TensorrtExecutionProvider, it is suggested enable `trt_fp16_enable` in `PerfTuning`.\n", + "3. With CUDAExecutionProvider, int8 quantization is not suggested in Onnxruntime.\n", + "4. With CUDAExecutionProvider, sometimes the `opt_level=2` is better for model A but `opt_level=1` is better for model B.\n", + "5. ...\n", + "\n", + "In this notebook, we will show how to use Olive to search the optimal optimization among different EPs for a given model, evaluation criteria and needed systems plus EPs." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prerequisites\n", + "Before running this notebook, please make sure you have installed the Olive package. Please refer to [this](https://github.com/microsoft/Olive?tab=readme-ov-file#installation) for more details." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Olive Optimizations Configs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Input model\n", + "In this notebook, we will use a simple `bert-base-uncased` model as an example:\n", + "\n", + "```json\n", + "\"input_model\":{\n", + " \"type\": \"PyTorchModel\",\n", + " \"config\": {\n", + " \"hf_config\": {\n", + " \"model_name\": \"Intel/bert-base-uncased-mrpc\",\n", + " \"task\": \"text-classification\",\n", + " \"dataset\": {\n", + " \"data_name\":\"glue\",\n", + " \"subset\": \"mrpc\",\n", + " \"split\": \"validation\",\n", + " \"input_cols\": [\"sentence1\", \"sentence2\"],\n", + " \"label_cols\": [\"label\"],\n", + " \"batch_size\": 1\n", + " }\n", + " }\n", + " }\n", + "}\n", + "```\n", + "\n", + "With the above input model, Olive will download the `bert-base-uncased-mrpc` model from Huggingface model hub and the corresponding dataset from the GLUE dataset. The model is a text classification model and the dataset is MRPC. The input data is a pair of sentences and the output is a label.\n", + "\n", + "\n", + "#### Evaluation Criteria\n", + "```json\n", + "\"evaluators\": {\n", + " \"common_evaluator\": {\n", + " \"metrics\":[\n", + " {\n", + " \"name\": \"accuracy\",\n", + " \"type\": \"accuracy\",\n", + " \"backend\": \"huggingface_metrics\",\n", + " \"sub_types\": [\n", + " {\"name\": \"accuracy\", \"priority\": 1, \"goal\": {\"type\": \"max-degradation\", \"value\": 0.01}},\n", + " {\"name\": \"f1\"}\n", + " ]\n", + " },\n", + " {\n", + " \"name\": \"latency\",\n", + " \"type\": \"latency\",\n", + " \"sub_types\": [\n", + " {\"name\": \"avg\", \"priority\": 2, \"goal\": {\"type\": \"percent-min-improvement\", \"value\": 20}},\n", + " {\"name\": \"max\"},\n", + " {\"name\": \"min\"}\n", + " ]\n", + " }\n", + " ]\n", + " }\n", + "}\n", + "```\n", + "We use `accuracy` and `latency` as the evaluation criteria. For `accuracy`, we use `accuracy` and `f1` as the sub-metrics. For `latency`, we use `avg`, `max` and `min` as the sub-metrics. Note that these two kinds of metrics own different goals. For `accuracy`, we want to maximize the `accuracy` and `f1`. For `latency`, we want to minimize the `avg`, `max` and `min` latency.\n", + "\n", + "\n", + "#### Devices\n", + "\n", + "We use `local_system` as the device in this notebook. We enable `CUDAExecutionProvider` and `TensorrtExecutionProvider` in the `accelerators` field. Olive will search different optimization configs among these two EPs.\n", + "\n", + "```json\n", + "\"systems\": {\n", + " \"local_system\": {\n", + " \"type\": \"LocalSystem\",\n", + " \"config\": {\n", + " \"accelerators\": [\n", + " {\n", + " \"device\": \"gpu\",\n", + " \"execution_providers\": [\n", + " \"CUDAExecutionProvider\",\n", + " \"TensorrtExecutionProvider\"\n", + " ]\n", + " }\n", + " ]\n", + " }\n", + " }\n", + "}\n", + "```\n", + "\n", + "#### Engine and search strategy\n", + "\n", + "Engine is used to manage the optimization process where we run optimization on host device, and run evaluation on target device.\n", + "Search strategy is used to search the optimal optimization among different EPs. In this notebook, we use `joint` as the `execution_order` and `tpe` as the `search_algorithm`. We set the `num_samples` to 1 and `seed` to 0.\n", + "\n", + "```json\n", + "\"engine\": {\n", + " \"search_strategy\": {\n", + " \"execution_order\": \"joint\",\n", + " \"search_algorithm\": \"tpe\",\n", + " \"search_algorithm_config\": {\n", + " \"num_samples\": 1,\n", + " \"seed\": 0\n", + " }\n", + " },\n", + " \"evaluator\": \"common_evaluator\",\n", + " \"host\": \"local_system\",\n", + " \"target\": \"local_system\",\n", + " \"cache_dir\": \"cache\",\n", + " \"output_dir\": \"models/bert_gpu\"\n", + "}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Start Optimization" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "metadata": {} + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[2024-04-18 17:28:05,856] [INFO] [run.py:261:run] Loading Olive module configuration from: /home/dummy_user/venv/lib/python3.8/site-packages/olive/olive_config.json\n", + "[2024-04-18 17:28:05,857] [INFO] [run.py:267:run] Loading run configuration from: bert_auto_opt_gpu.json\n", + "2024-04-18 17:28:06.856491: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", + "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", + "[2024-04-18 17:28:16,070] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-cuda,gpu-tensorrt\n", + "[2024-04-18 17:28:16,138] [INFO] [engine.py:106:initialize] Using cache directory: cache\n", + "[2024-04-18 17:28:16,139] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-cuda\n", + "[2024-04-18 17:28:16,263] [INFO] [engine.py:324:run_accelerator] Input model evaluation results: {\n", + " \"accuracy-accuracy\": 0.8602941176470589,\n", + " \"accuracy-f1\": 0.9042016806722689,\n", + " \"latency-avg\": 12.60843,\n", + " \"latency-max\": 12.95492,\n", + " \"latency-min\": 12.46612\n", + "}\n", + "[2024-04-18 17:28:16,264] [INFO] [engine.py:329:run_accelerator] Saved evaluation results of input model to models/bert_gpu/gpu-cuda_input_model_metrics.json\n", + "/home/dummy_user/venv/lib/python3.8/site-packages/optuna/samplers/_tpe/sampler.py:281: ExperimentalWarning: ``multivariate`` option is an experimental feature. The interface can change in the future.\n", + " warnings.warn(\n", + "/home/dummy_user/venv/lib/python3.8/site-packages/optuna/samplers/_tpe/sampler.py:292: ExperimentalWarning: ``group`` option is an experimental feature. The interface can change in the future.\n", + " warnings.warn(\n", + "[2024-04-18 17:28:16,268] [INFO] [engine.py:864:_run_pass] Running pass OnnxConversion:OnnxConversion\n", + "[2024-04-18 17:28:16,270] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c from cache/runs\n", + "[2024-04-18 17:28:16,270] [INFO] [engine.py:864:_run_pass] Running pass OrtTransformersOptimization:OrtTransformersOptimization\n", + "[2024-04-18 17:28:16,272] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 1_OrtTransformersOptimization-0-ce57cc2e4971809c50a213199be68734-gpu-cuda from cache/runs\n", + "[2024-04-18 17:28:16,272] [INFO] [engine.py:864:_run_pass] Running pass OrtPerfTuning:OrtPerfTuning\n", + "[2024-04-18 17:28:16,275] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 2_OrtPerfTuning-1-33c0967de0df62dc62b8f7ffc4fe4956-gpu-cuda from cache/runs\n", + "[2024-04-18 17:28:16,275] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...\n", + "[2024-04-18 17:28:16,279] [INFO] [engine.py:864:_run_pass] Running pass OnnxConversion:OnnxConversion\n", + "[2024-04-18 17:28:16,280] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c from cache/runs\n", + "[2024-04-18 17:28:16,280] [INFO] [engine.py:864:_run_pass] Running pass OrtTransformerOptimization_cuda_fp16:OrtTransformersOptimization\n", + "[2024-04-18 17:28:16,282] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 3_OrtTransformersOptimization-0-de339c0fa2131b21f568f4173abc200b-gpu-cuda from cache/runs\n", + "[2024-04-18 17:28:16,283] [INFO] [engine.py:864:_run_pass] Running pass OrtPerfTuning:OrtPerfTuning\n", + "[2024-04-18 17:28:16,286] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 4_OrtPerfTuning-3-33c0967de0df62dc62b8f7ffc4fe4956-gpu-cuda from cache/runs\n", + "[2024-04-18 17:28:16,286] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...\n", + "[2024-04-18 17:28:16,289] [INFO] [engine.py:864:_run_pass] Running pass OnnxConversion:OnnxConversion\n", + "[2024-04-18 17:28:16,291] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c from cache/runs\n", + "[2024-04-18 17:28:16,291] [INFO] [engine.py:864:_run_pass] Running pass OrtTransformersOptimization:OrtTransformersOptimization\n", + "[2024-04-18 17:28:16,293] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 1_OrtTransformersOptimization-0-ce57cc2e4971809c50a213199be68734-gpu-cuda from cache/runs\n", + "[2024-04-18 17:28:16,293] [INFO] [engine.py:864:_run_pass] Running pass OrtMixedPrecision:OrtMixedPrecision\n", + "[2024-04-18 17:28:16,294] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 5_OrtMixedPrecision-1-1e525c46c7f3e5c3f337c2ac8c50a03d from cache/runs\n", + "[2024-04-18 17:28:16,295] [INFO] [engine.py:864:_run_pass] Running pass OrtPerfTuning:OrtPerfTuning\n", + "[2024-04-18 17:28:16,298] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 6_OrtPerfTuning-5-33c0967de0df62dc62b8f7ffc4fe4956-gpu-cuda from cache/runs\n", + "[2024-04-18 17:28:16,298] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...\n", + "[2024-04-18 17:28:16,299] [INFO] [footprint.py:101:create_pareto_frontier] Output all 8 models\n", + "[2024-04-18 17:28:16,300] [INFO] [footprint.py:120:_create_pareto_frontier_from_nodes] pareto frontier points: 4_OrtPerfTuning-3-33c0967de0df62dc62b8f7ffc4fe4956-gpu-cuda \n", + "{\n", + " \"accuracy-accuracy\": 0.8602941176470589,\n", + " \"accuracy-f1\": 0.9042016806722689,\n", + " \"latency-avg\": 1.33635,\n", + " \"latency-max\": 1.34257,\n", + " \"latency-min\": 1.33018\n", + "}\n", + "[2024-04-18 17:28:16,302] [INFO] [engine.py:361:run_accelerator] Save footprint to models/bert_gpu/gpu-cuda_footprints.json.\n", + "[2024-04-18 17:28:16,307] [INFO] [engine.py:279:run] Run history for gpu-cuda:\n", + "[2024-04-18 17:28:16,317] [INFO] [engine.py:567:dump_run_history] run history:\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| model_id | parent_model_id | from_pass | duration_sec | metrics |\n", + "+====================================================================================+====================================================================================+=============================+================+============================================+\n", + "| 53fc6781998a4624b61959bb064622ce | | | | { |\n", + "| | | | | \"accuracy-accuracy\": 0.8602941176470589, |\n", + "| | | | | \"accuracy-f1\": 0.9042016806722689, |\n", + "| | | | | \"latency-avg\": 12.60843, |\n", + "| | | | | \"latency-max\": 12.95492, |\n", + "| | | | | \"latency-min\": 12.46612 |\n", + "| | | | | } |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c | 53fc6781998a4624b61959bb064622ce | OnnxConversion | 11.6793 | |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 1_OrtTransformersOptimization-0-ce57cc2e4971809c50a213199be68734-gpu-cuda | 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c | OrtTransformersOptimization | 7.86033 | |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 2_OrtPerfTuning-1-33c0967de0df62dc62b8f7ffc4fe4956-gpu-cuda | 1_OrtTransformersOptimization-0-ce57cc2e4971809c50a213199be68734-gpu-cuda | OrtPerfTuning | 28.6088 | { |\n", + "| | | | | \"accuracy-accuracy\": 0.8602941176470589, |\n", + "| | | | | \"accuracy-f1\": 0.9042016806722689, |\n", + "| | | | | \"latency-avg\": 2.29684, |\n", + "| | | | | \"latency-max\": 2.30272, |\n", + "| | | | | \"latency-min\": 2.29272 |\n", + "| | | | | } |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 3_OrtTransformersOptimization-0-de339c0fa2131b21f568f4173abc200b-gpu-cuda | 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c | OrtTransformersOptimization | 9.70354 | |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 4_OrtPerfTuning-3-33c0967de0df62dc62b8f7ffc4fe4956-gpu-cuda | 3_OrtTransformersOptimization-0-de339c0fa2131b21f568f4173abc200b-gpu-cuda | OrtPerfTuning | 25.5901 | { |\n", + "| | | | | \"accuracy-accuracy\": 0.8602941176470589, |\n", + "| | | | | \"accuracy-f1\": 0.9042016806722689, |\n", + "| | | | | \"latency-avg\": 1.33635, |\n", + "| | | | | \"latency-max\": 1.34257, |\n", + "| | | | | \"latency-min\": 1.33018 |\n", + "| | | | | } |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 5_OrtMixedPrecision-1-1e525c46c7f3e5c3f337c2ac8c50a03d | 1_OrtTransformersOptimization-0-ce57cc2e4971809c50a213199be68734-gpu-cuda | OrtMixedPrecision | 6.87652 | |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 6_OrtPerfTuning-5-33c0967de0df62dc62b8f7ffc4fe4956-gpu-cuda | 5_OrtMixedPrecision-1-1e525c46c7f3e5c3f337c2ac8c50a03d | OrtPerfTuning | 31.2253 | { |\n", + "| | | | | \"accuracy-accuracy\": 0.8602941176470589, |\n", + "| | | | | \"accuracy-f1\": 0.9042016806722689, |\n", + "| | | | | \"latency-avg\": 1.33801, |\n", + "| | | | | \"latency-max\": 1.34434, |\n", + "| | | | | \"latency-min\": 1.33169 |\n", + "| | | | | } |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "[2024-04-18 17:28:16,317] [INFO] [engine.py:285:run] Package top ranked 1 models as artifacts\n", + "[2024-04-18 17:28:16,318] [INFO] [packaging_generator.py:67:_package_candidate_models] Packaging output models to PackagingType.Zipfile\n", + "[2024-04-18 17:28:42,335] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-tensorrt\n", + "[2024-04-18 17:28:42,432] [INFO] [engine.py:324:run_accelerator] Input model evaluation results: {\n", + " \"accuracy-accuracy\": 0.8602941176470589,\n", + " \"accuracy-f1\": 0.9042016806722689,\n", + " \"latency-avg\": 12.52778,\n", + " \"latency-max\": 12.80721,\n", + " \"latency-min\": 12.20827\n", + "}\n", + "[2024-04-18 17:28:42,433] [INFO] [engine.py:329:run_accelerator] Saved evaluation results of input model to models/bert_gpu/gpu-tensorrt_input_model_metrics.json\n", + "[2024-04-18 17:28:42,436] [INFO] [engine.py:864:_run_pass] Running pass OnnxConversion:OnnxConversion\n", + "[2024-04-18 17:28:42,437] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c from cache/runs\n", + "[2024-04-18 17:28:42,437] [INFO] [engine.py:864:_run_pass] Running pass OrtTransformersOptimization:OrtTransformersOptimization\n", + "[2024-04-18 17:28:42,438] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 7_OrtTransformersOptimization-0-ce57cc2e4971809c50a213199be68734-gpu-tensorrt from cache/runs\n", + "[2024-04-18 17:28:42,438] [WARNING] [engine.py:847:_run_passes] Skipping evaluation as model was pruned\n", + "[2024-04-18 17:28:42,441] [INFO] [engine.py:864:_run_pass] Running pass OnnxConversion:OnnxConversion\n", + "[2024-04-18 17:28:42,442] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c from cache/runs\n", + "[2024-04-18 17:28:42,442] [INFO] [engine.py:864:_run_pass] Running pass OrtTransformersOptimization:OrtTransformersOptimization\n", + "[2024-04-18 17:28:42,443] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 7_OrtTransformersOptimization-0-ce57cc2e4971809c50a213199be68734-gpu-tensorrt from cache/runs\n", + "[2024-04-18 17:28:42,443] [WARNING] [engine.py:847:_run_passes] Skipping evaluation as model was pruned\n", + "[2024-04-18 17:28:42,443] [WARNING] [footprint.py:258:_get_candidates] There is no expected candidates. Please check: 1. if the metric goal is too strict; 2. if pass config is set correctly.\n", + "[2024-04-18 17:28:42,443] [INFO] [footprint.py:101:create_pareto_frontier] Output all 3 models\n", + "[2024-04-18 17:28:42,443] [WARNING] [footprint.py:117:_create_pareto_frontier_from_nodes] There is no pareto frontier points.\n", + "[2024-04-18 17:28:42,443] [INFO] [engine.py:361:run_accelerator] Save footprint to models/bert_gpu/gpu-tensorrt_footprints.json.\n", + "[2024-04-18 17:28:42,444] [INFO] [engine.py:279:run] Run history for gpu-cuda:\n", + "[2024-04-18 17:28:42,446] [INFO] [engine.py:567:dump_run_history] run history:\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| model_id | parent_model_id | from_pass | duration_sec | metrics |\n", + "+====================================================================================+====================================================================================+=============================+================+============================================+\n", + "| 53fc6781998a4624b61959bb064622ce | | | | { |\n", + "| | | | | \"accuracy-accuracy\": 0.8602941176470589, |\n", + "| | | | | \"accuracy-f1\": 0.9042016806722689, |\n", + "| | | | | \"latency-avg\": 12.60843, |\n", + "| | | | | \"latency-max\": 12.95492, |\n", + "| | | | | \"latency-min\": 12.46612 |\n", + "| | | | | } |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c | 53fc6781998a4624b61959bb064622ce | OnnxConversion | 11.6793 | |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 1_OrtTransformersOptimization-0-ce57cc2e4971809c50a213199be68734-gpu-cuda | 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c | OrtTransformersOptimization | 7.86033 | |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 2_OrtPerfTuning-1-33c0967de0df62dc62b8f7ffc4fe4956-gpu-cuda | 1_OrtTransformersOptimization-0-ce57cc2e4971809c50a213199be68734-gpu-cuda | OrtPerfTuning | 28.6088 | { |\n", + "| | | | | \"accuracy-accuracy\": 0.8602941176470589, |\n", + "| | | | | \"accuracy-f1\": 0.9042016806722689, |\n", + "| | | | | \"latency-avg\": 2.29684, |\n", + "| | | | | \"latency-max\": 2.30272, |\n", + "| | | | | \"latency-min\": 2.29272 |\n", + "| | | | | } |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 3_OrtTransformersOptimization-0-de339c0fa2131b21f568f4173abc200b-gpu-cuda | 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c | OrtTransformersOptimization | 9.70354 | |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 4_OrtPerfTuning-3-33c0967de0df62dc62b8f7ffc4fe4956-gpu-cuda | 3_OrtTransformersOptimization-0-de339c0fa2131b21f568f4173abc200b-gpu-cuda | OrtPerfTuning | 25.5901 | { |\n", + "| | | | | \"accuracy-accuracy\": 0.8602941176470589, |\n", + "| | | | | \"accuracy-f1\": 0.9042016806722689, |\n", + "| | | | | \"latency-avg\": 1.33635, |\n", + "| | | | | \"latency-max\": 1.34257, |\n", + "| | | | | \"latency-min\": 1.33018 |\n", + "| | | | | } |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 5_OrtMixedPrecision-1-1e525c46c7f3e5c3f337c2ac8c50a03d | 1_OrtTransformersOptimization-0-ce57cc2e4971809c50a213199be68734-gpu-cuda | OrtMixedPrecision | 6.87652 | |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 6_OrtPerfTuning-5-33c0967de0df62dc62b8f7ffc4fe4956-gpu-cuda | 5_OrtMixedPrecision-1-1e525c46c7f3e5c3f337c2ac8c50a03d | OrtPerfTuning | 31.2253 | { |\n", + "| | | | | \"accuracy-accuracy\": 0.8602941176470589, |\n", + "| | | | | \"accuracy-f1\": 0.9042016806722689, |\n", + "| | | | | \"latency-avg\": 1.33801, |\n", + "| | | | | \"latency-max\": 1.34434, |\n", + "| | | | | \"latency-min\": 1.33169 |\n", + "| | | | | } |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "[2024-04-18 17:28:42,446] [INFO] [engine.py:279:run] Run history for gpu-tensorrt:\n", + "[2024-04-18 17:28:42,447] [INFO] [engine.py:567:dump_run_history] run history:\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| model_id | parent_model_id | from_pass | duration_sec | metrics |\n", + "+====================================================================================+====================================================================================+=============================+================+============================================+\n", + "| 53fc6781998a4624b61959bb064622ce | | | | { |\n", + "| | | | | \"accuracy-accuracy\": 0.8602941176470589, |\n", + "| | | | | \"accuracy-f1\": 0.9042016806722689, |\n", + "| | | | | \"latency-avg\": 12.52778, |\n", + "| | | | | \"latency-max\": 12.80721, |\n", + "| | | | | \"latency-min\": 12.20827 |\n", + "| | | | | } |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c | 53fc6781998a4624b61959bb064622ce | OnnxConversion | 11.6793 | |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "| 7_OrtTransformersOptimization-0-ce57cc2e4971809c50a213199be68734-gpu-tensorrt | 0_OnnxConversion-53fc6781998a4624b61959bb064622ce-bacd09aac459011bdd7b4c86b291808c | OrtTransformersOptimization | 8.23434 | |\n", + "+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+--------------------------------------------+\n", + "[2024-04-18 17:28:42,448] [INFO] [engine.py:285:run] Package top ranked 0 models as artifacts\n", + "[2024-04-18 17:28:42,448] [WARNING] [packaging_generator.py:48:generate_output_artifacts] No model is selected. Skip packaging output artifacts.\n" + ] + } + ], + "source": [ + "! python -m olive.workflows.run --config bert_auto_opt_gpu.json" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Based on the above olive running history, we can see that:\n", + "1. Olive searched the configs for several rounds and found the optimal optimization and packed it with zip format to save.\n", + "2. During the search process, we can see that the invalid config are pruned and the valid configs are evaluated.\n", + "3. The best config is saved in the `models/bert_gpu` folder. And here is comparison between the output model and input model.\n", + "\n", + "| model type | accuracy-accuracy | accuracy-f1 | latency-avg | latency-max | latency-min |\n", + "| --- | --- | --- | --- | --- | --- |\n", + "| Pytorch | 0.8603 | 0.9042 | 12.5278 | 12.8072 | 12.2083 |\n", + "| Olive Optimization | 0.8603 | 0.9042 | 1.3364 | 1.3426 | 1.3302 |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}