diff --git a/README.md b/README.md
index c6cbbfd..a0d1310 100644
--- a/README.md
+++ b/README.md
@@ -10,11 +10,11 @@
 And unlike `TensorRT` or `AITemplate`, which takes dozens of minutes to compile a model, `stable-fast` only takes a few seconds to compile a model.
 `stable-fast` also supports `dynamic shape`, `LoRA` and `ControlNet` out of the box.
 
-[![](https://mermaid.ink/img/pako:eNpFUsGOmzAQ_ZWRpSgXIDYsCXCoVGl76KGX3RyqrvcwwACWwEbY7CZC_HtNqLYHj948j_zezHhhlamJFexwWJRWroDl6Doa6FjAsTETWXdcYT0cpL7dqw4nF5bkUGqnXE8g2VUNBL4QWtI0oVO6BaMJ1IAtwadyHbw-g4jSAFIR3_zxgIN1NNrAV8LL9Tc88YxL5iVCvCkLb5J9oFZ9j-DMVHWSBV7pAaPKDKPqaae-_7zSMPbo_uVeuOnN555cSVszvVwhj3gkds46LHsKG7ROsnep77ugZINXU5Yqo2srGXAIw28Qc86llrrECd5Ell8CEKngPoo085HzJIA8FxsU6TsL2EDTgKr281ykhs30NkvJCg9LtLQ1ufo6nJ15veuKFW6aKWDzWPsmnhW2Ew6saLC3X-yPWvnWv8jeYE0-XZi7j9vmWmWdf9Jbb1S78fPUe7pzbrTF6bRdR63fw1xuwztZVW9r7D7y8-kcnzOMEzpfEkyTpK5KkWdN_CSa-sJFjGxdAzai_mPMf1f08PNr_zaP37P-BVvguY0?type=png)](https://mermaid.live/edit#pako:eNpFUsGOmzAQ_ZWRpSgXIDYsCXCoVGl76KGX3RyqrvcwwACWwEbY7CZC_HtNqLYHj948j_zezHhhlamJFexwWJRWroDl6Doa6FjAsTETWXdcYT0cpL7dqw4nF5bkUGqnXE8g2VUNBL4QWtI0oVO6BaMJ1IAtwadyHbw-g4jSAFIR3_zxgIN1NNrAV8LL9Tc88YxL5iVCvCkLb5J9oFZ9j-DMVHWSBV7pAaPKDKPqaae-_7zSMPbo_uVeuOnN555cSVszvVwhj3gkds46LHsKG7ROsnep77ugZINXU5Yqo2srGXAIw28Qc86llrrECd5Ell8CEKngPoo085HzJIA8FxsU6TsL2EDTgKr281ykhs30NkvJCg9LtLQ1ufo6nJ15veuKFW6aKWDzWPsmnhW2Ew6saLC3X-yPWvnWv8jeYE0-XZi7j9vmWmWdf9Jbb1S78fPUe7pzbrTF6bRdR63fw1xuwztZVW9r7D7y8-kcnzOMEzpfEkyTpK5KkWdN_CSa-sJFjGxdAzai_mPMf1f08PNr_zaP37P-BVvguY0)
+[![](https://mermaid.ink/img/pako:eNpFUsGOmzAQ_ZWRpSgXIDYsCXCoVGl76KGXXQ5V13sYYABLYCNsdhNF_HtNqLYHj948j_zezPjOatMQK9jhcFdauQLuR9fTSMcCjq2ZybrjCuvhIPX1Vvc4u7Aih1I75QYCyUo1EvhC6EjTjE7pDowmUCN2BJ_K9fD6DCJKA0hFfPXHAw7W0WQDXwkv5W944hmXzEuEeFUW3iT7QK2GAcGZue4lC7zSA0a1GSc10E59_1nSOA3o_uVeuB3M556UpK2ZX0rIIx6JnbMOq4HCFq2T7F3q2y4o2ejVlKXa6MZKBhzC8BvEnHOppa5whjeR5ZcARCq4jyLNfOQ8CSDPxQYFf2cBG2keUTV-nnepYTO9zVKywsMKLW1Nrr4OF2deb7pmhZsXCtgyNb6JZ4XdjCMrWhzsF_ujUb71L3Iw2JBP78zdpm1znbLOP-mtt6rb-GUePN07N9nidNquo87vYam24Z2sarY19h_5-XSOzxnGCZ0vCaZJ0tSVyLM2fhJtc-EiRrauAZtQ_zHmvyt6-Pm1f5vH71n_Alb0uYg?type=png)](https://mermaid.live/edit#pako:eNpFUsGOmzAQ_ZWRpSgXIDYsCXCoVGl76KGXXQ5V13sYYABLYCNsdhNF_HtNqLYHj948j_zezPjOatMQK9jhcFdauQLuR9fTSMcCjq2ZybrjCuvhIPX1Vvc4u7Aih1I75QYCyUo1EvhC6EjTjE7pDowmUCN2BJ_K9fD6DCJKA0hFfPXHAw7W0WQDXwkv5W944hmXzEuEeFUW3iT7QK2GAcGZue4lC7zSA0a1GSc10E59_1nSOA3o_uVeuB3M556UpK2ZX0rIIx6JnbMOq4HCFq2T7F3q2y4o2ejVlKXa6MZKBhzC8BvEnHOppa5whjeR5ZcARCq4jyLNfOQ8CSDPxQYFf2cBG2keUTV-nnepYTO9zVKywsMKLW1Nrr4OF2deb7pmhZsXCtgyNb6JZ4XdjCMrWhzsF_ujUb71L3Iw2JBP78zdpm1znbLOP-mtt6rb-GUePN07N9nidNquo87vYam24Z2sarY19h_5-XSOzxnGCZ0vCaZJ0tSVyLM2fhJtc-EiRrauAZtQ_zHmvyt6-Pm1f5vH71n_Alb0uYg)
 
 | Framework | torch | torch.compile | AIT  | oneflow | TensorRT | __stable-fast__ |
 | --------- | ----- | ------------- | ---- | ------- | -------- | --------------- |
-| Time/ms   | 1897  | 1510          | 1158 | 1003    | 991      | __1015__        |
+| Time/ms   | 1897  | 1510          | 1158 | 1003    | 991      | __1010__        |
 
 __NOTE__: During benchmarking, `TensorRT` is tested with `static batch size` and `CUDA Graph enabled` while `stable-fast` is running with full dynamic shape.
 
@@ -32,12 +32,9 @@ __NOTE__: During benchmarking, `TensorRT` is tested with `static batch size` and
     - [Model Quantization](#model-quantization)
     - [Some Common Methods To Speed Up PyTorch](#some-common-methods-to-speed-up-pytorch)
   - [Performance Comparison](#performance-comparison)
-    - [RTX 4080 (512x512, batch size 1, fp16, tcmalloc enabled, in WSL2)](#rtx-4080-512x512-batch-size-1-fp16-tcmalloc-enabled-in-wsl2)
-    - [RTX 4090 (512x512, batch size 1, fp16, tcmalloc enabled)](#rtx-4090-512x512-batch-size-1-fp16-tcmalloc-enabled)
-    - [RTX 3080 Ti (512x512, batch size 1, fp16, tcmalloc enabled)](#rtx-3080-ti-512x512-batch-size-1-fp16-tcmalloc-enabled)
-    - [RTX 3090 (512x512, batch size 1, fp16, tcmalloc enabled)](#rtx-3090-512x512-batch-size-1-fp16-tcmalloc-enabled)
+    - [RTX 4080 (512x512, batch size 1, fp16, in WSL2)](#rtx-4080-512x512-batch-size-1-fp16-in-wsl2)
     - [H100](#h100)
-    - [A100 PCIe 40GB](#a100-pcie-40gb)
+    - [A100](#a100)
   - [Compatibility](#compatibility)
   - [Troubleshooting](#troubleshooting)
 
@@ -269,81 +266,47 @@ Performance varies very greatly across different hardware/software/platform/driv
 It is very hard to benchmark accurately. And preparing the environment for benchmarking is also a hard job.
 I have tested on some platforms before but the results may still be inaccurate.
 Note that when benchmarking, the progress bar showed by `tqdm` may be inaccurate because of the asynchronous nature of CUDA.
-To solve this problem, I have to add `torch.cuda.synchronize()` after every inference step, which will slow down the inference,
-so the results might not be very accurate and might be slower than the actual performance.
+To solve this problem, I use `CUDA Event` to measure the speed of iterations per second accurately.
 
 `stable-fast` is expected to work better on newer GPUs and newer CUDA versions.
 __On older GPUs, the performance increase might be limited.__
 __During benchmarking, the progress bar might work incorrectly because of the asynchronous nature of CUDA.__
 
-### RTX 4080 (512x512, batch size 1, fp16, tcmalloc enabled, in WSL2)
+### RTX 4080 (512x512, batch size 1, fp16, in WSL2)
 
 This is my personal gaming PC😄. It has a more powerful CPU than those from cloud server providers.
 
-| Framework                                | SD 1.5        | SD 2.1        | SD XL (1024x1024) |
-| ---------------------------------------- | ------------- | ------------- | ----------------- |
-| Vanilla PyTorch (2.1.0+cu118)            | 29.5 it/s     | 32.4 it/s     | 4.6 it/s          |
-| torch.compile (2.1.0+cu118, NHWC UNet)   | 40.0 it/s     | 44.0 it/s     | 6.1 it/s          |
-| AITemplate                               | 44.2 it/s     | untested      | untested          |
-| OneFlow                                  | 53.6 it/s     | untested      | untested          |
-| AUTO1111 WebUI                           | 17.2 it/s     | 15.2 it/s     | 3.6 it/s          |
-| AUTO1111 WebUI (with SDPA)               | 24.5 it/s     | 26.1 it/s     | 4.3 it/s          |
-| TensorRT (AUTO1111 WebUI)                | 40.8 it/s     | untested      | untested          |
-| TensorRT Official Demo                   | 52.6 it/s     | untested      | untested          |
-| __Stable Fast (with xformers & Triton)__ | __50.5 it/s__ | __53.3 it/s__ | __8.3 it/s__      |
-
-### RTX 4090 (512x512, batch size 1, fp16, tcmalloc enabled)
-
-| Framework                                | SD 1.5        | SD 2.1         | SD 1.5 ControlNet |
-| ---------------------------------------- | ------------- | -------------- | ----------------- |
-| Vanilla PyTorch (2.1.0+cu118)            | 24.9 it/s     | 27.1 it/s      | 18.9 it/s         |
-| torch.compile (2.1.0+cu118, NHWC UNet)   | 33.5 it/s     | 38.2 it/s      | 22.7 it/s         |
-| AITemplate                               | 65.7 it/s     | 71.6 it/s      | untested          |
-| OneFlow                                  | 60.1 it/s     | 12.9 it/s (??) | untested          |
-| TensorRT                                 | untested      | untested       | untested          |
-| __Stable Fast (with xformers & Triton)__ | __61.8 it/s__ | __61.6 it/s__  | __42.3 it/s__     |
-
-(??): OneFlow seems to be not working well with SD 2.1
-
-### RTX 3080 Ti (512x512, batch size 1, fp16, tcmalloc enabled)
-
-| Framework                                | SD 1.5        | SD 2.1         | SD 1.5 ControlNet |
-| ---------------------------------------- | ------------- | -------------- | ----------------- |
-| Vanilla PyTorch (2.1.0+cu118)            | 19.3 it/s     | 20.4 it/s      | 13.8 it/s         |
-| torch.compile (2.1.0+cu118, NHWC UNet)   | 24.4 it/s     | 26.9 it/s      | 17.7 it/s         |
-| AITemplate                               | untested      | untested       | untested          |
-| OneFlow                                  | 32.8 it/s     | 8.82 it/s (??) | untested          |
-| TensorRT                                 | untested      | untested       | untested          |
-| __Stable Fast (with xformers & Triton)__ | __28.1 it/s__ | __30.2 it/s__  | __20.0 it/s__     |
-
-(??): OneFlow seems to be not working well with SD 2.1
-
-### RTX 3090 (512x512, batch size 1, fp16, tcmalloc enabled)
-
-| Framework                                | SD 1.5        |
-| ---------------------------------------- | ------------- |
-| Vanilla PyTorch (2.1.0+cu118)            | 22.5 it/s     |
-| torch.compile (2.1.0+cu118, NHWC UNet)   | 25.3 it/s     |
-| AITemplate                               | 34.6 it/s     |
-| OneFlow                                  | 38.8 it/s     |
-| TensorRT                                 | untested      |
-| __Stable Fast (with xformers & Triton)__ | __31.5 it/s__ |
+| Framework                                | SD 1.5        | SD XL (1024x1024) | SD 1.5 ControlNet |
+| ---------------------------------------- | ------------- | ----------------- | ----------------- |
+| Vanilla PyTorch (2.1.0)                  | 29.5 it/s     | 4.6 it/s          | 19.7 it/s         |
+| torch.compile (2.1.0, max-autotune)      | 40.0 it/s     | 6.1 it/s          | 20.t it/s         |
+| AITemplate                               | 44.2 it/s     |                   |                   |
+| OneFlow                                  | 53.6 it/s     |                   |                   |
+| AUTO1111 WebUI                           | 17.2 it/s     | 3.6 it/s          |                   |
+| AUTO1111 WebUI (with SDPA)               | 24.5 it/s     | 4.3 it/s          |                   |
+| TensorRT (AUTO1111 WebUI)                | 40.8 it/s     |                   |                   |
+| TensorRT Official Demo                   | 52.6 it/s     |                   |                   |
+| __stable-fast (with xformers & Triton)__ | __50.8 it/s__ | __8.5 it/s__      | __36.6 it/s__     |
 
 ### H100
 
-Thanks for __@Consceleratus__'s help, I have tested speed on H100.
+Thanks for __@Consceleratus__ and __@harishp__'s help, I have tested speed on H100.
 
-Detailed benchmarking results will be available soon.
+| Framework                                | SD 1.5         | SD XL (1024x1024) | SD 1.5 ControlNet |
+| ---------------------------------------- | -------------- | ----------------- | ----------------- |
+| Vanilla PyTorch (2.1.0)                  | 54.5 it/s      | 14.9 it/s         | 35.8 it/s         |
+| torch.compile (2.1.0, max-autotune)      | 66.0 it/s      | 18.5 it/s         | 40.7 it/s         |
+| __stable-fast (with xformers & Triton)__ | __104.6 it/s__ | __21.6 it/s__     | __72.6 it/s__     |
 
-### A100 PCIe 40GB
+### A100
 
-Thanks for __@SuperSecureHuman__ and __@jon-chuang__'s help, benchmarking on A100 PCIe 40GB is available now.
+Thanks for __@SuperSecureHuman__ and __@jon-chuang__'s help, benchmarking on A100 is available now.
 
-| Framework                                | SD 1.5        | SD 2.1         | SD 1.5 ControlNet | SD XL         |
-| ---------------------------------------- | ------------- | -------------- | ----------------- | --------------|
-| Vanilla PyTorch (2.1.0+cu118)            | 23.8 it/s     | 23.8 it/s      | 15.7 it/s         | 10.0 it/s     |
-| torch.compile (2.1.0+cu118, NHWC UNet)   | 37.7 it/s     | 42.7 it/s      | 24.7 it/s         | 20.9 it/s     |
-| __Stable Fast (with xformers & Triton)__ | __58.0 it/s__ | __outdated__   | __outdated__      | __outdated__  |
+| Framework                                | SD 1.5        | SD XL (1024x1024) |
+| ---------------------------------------- | ------------- | ----------------- |
+| Vanilla PyTorch (2.1.0)                  | 35.6 it/s     | 8.7 it/s          |
+| torch.compile (2.1.0, max-autotune)      | 41.9 it/s     | 10.0 it/s         |
+| __stable-fast (with xformers & Triton)__ | __61.8 it/s__ | __11.9 it/s__     |
 
 ## Compatibility