Skip to content

Commit

Permalink
Torchscript support (#159)
Browse files Browse the repository at this point in the history
* support torchscript

* add nms

* add torchscript configs and update deploy process and dump-info

* typescript -> torchscript

* add torchscript custom extension support

* add ts custom ops again

* support mmseg unet

* [WIP] add optimizer for torchscript (#119)

* add passes

* add python api

* Torchscript optimizer python api (#121)

* add passes

* add python api

* use python api instead of executable

* Merge Master, update optimizer (#151)

* [Feature] add yolox ncnn (#29)

* add yolox ncnn

* add ncnn android performance of yolox

* add ut

* fix lint

* fix None bugs for ncnn

* test codecov

* test codecov

* add device

* fix yapf

* remove if-else for img shape

* use channelshuffle optimize

* change benchmark after channelshuffle

* fix yapf

* fix yapf

* fuse continuous reshape

* fix static shape deploy

* fix code

* drop pad

* only static shape

* fix static

* fix docstring

* Added mask overlay to output image, changed fprintf info messages to … (#55)

* Added mask overlay to output image, changed fprintf info messages to stdout

* Improved box filtering (filter area/score), make sure roi coordinates stay within bounds

* clang-format

* Support UNet in mmseg (#77)

* Repeatdataset in train has no CLASSES & PALETTE

* update result for unet

* update docstring for mmdet

* remove ppl for unet in docs

* fix ort wrap about input type (#81)

* Fix memleak (#86)

* delete []

* fix build error when enble MMDEPLOY_ACTIVE_LEVEL

* fix lint

* [Doc] Nano benchmark and tutorial (#71)

* add cls benchmark

* add nano zh-cn benchmark and en tutorial

* add device row

* add doc path to index.rst

* fix typo

* [Fix] fix missing deploy_core (#80)

* fix missing deploy_core

* mv flag to demo

* target link

* [Docs] Fix links in Chinese doc (#84)

* Fix docs in Chinese link

* Fix links

* Delete symbolic link and add links to html

* delete files

* Fix link

* [Feature] Add docker files (#67)

* add gpu and cpu dockerfile

* fix lint

* fix cpu docker and remove redundant

* use pip instead

* add build arg and readme

* fix grammar

* update readme

* add chinese doc for dockerfile and add docker build to build.md

* grammar

* refine dockerfiles

* add FAQs

* update Dpplcv_DIR for SDK building

* remove mmcls

* add sdk demos

* fix typo and lint

* update FAQs

* [Fix]fix check_env (#101)

* fix check_env

* update

* Replace convert_syncbatchnorm in mmseg (#93)

* replace convert_syncbatchnorm with revert_sync_batchnorm from mmcv

* change logger

* [Doc] Update FAQ for TensorRT (#96)

* update FAQ

* comment

* [Docs]: Update doc for openvino installation (#102)

* fix docs

* fix docs

* fix docs

* fix mmcv version

* fix docs

* rm blank line

* simplify non batch nms (#99)

* [Enhacement] Allow test.py to save evaluation results (#108)

* Add log file

* Delete debug code

* Rename logger

* resolve comments

* [Enhancement] Support mmocr v0.4+ (#115)

* support mmocr v0.4+

* 0.4.0 -> 0.4.1

* fix onnxruntime wrapper for gpu inference (#123)

* fix ncnn wrapper for ort-gpu

* resolve comment

* fix lint

* Fix typo (#132)

* lock mmcls version (#131)

* [Enhancement] upgrade isort in pre-commit config (#141)

* [Enhancement] upgrade isort in pre-commit config by refering to mmflow pr #87

* fix lint

* remove .isort.cfg and put its known_third_party to setup.cfg

* Fix ci for mmocr (#144)

* fix mmocr unittests

* remove useless

* lock mmdet maximum version to 2.20

* pip install -U numpy

* Fix capture_output (#125)

Co-authored-by: hanrui1sensetime <[email protected]>
Co-authored-by: Johannes L <[email protected]>
Co-authored-by: RunningLeon <[email protected]>
Co-authored-by: VVsssssk <[email protected]>
Co-authored-by: lvhan028 <[email protected]>
Co-authored-by: AllentDan <[email protected]>
Co-authored-by: Yifan Zhou <[email protected]>
Co-authored-by: 杨培文 (Yang Peiwen) <[email protected]>
Co-authored-by: Semyon Bevzyuk <[email protected]>

* configs for all tasks

* use torchvision roi align

* remote unnecessary code

* fix ut

* fix ut

* export

* det dynamic

* det dynamic

* add ut

* fix ut

* add ut and docs

* fix ut

* skip torchscript ut if no ops available

* add torchscript option to build.md

* update benchmark and resolve comments

* resolve conflicts

* rename configs

* fix mrcnn cuda test

* remove useless

* add version requirements to docs and comments to codes

* enable empty image exporting for torchscript and accelerate ORT inference for MRCNN

* rebase

* update example for torchscript.md

* update FAQs for torchscript.md

* resolve comments

* only use torchvision roi_align for torchscript

* fix ut

* use torchvision roi align when pool model is avg

* resolve comments

Co-authored-by: grimoire <[email protected]>
Co-authored-by: grimoire <[email protected]>
Co-authored-by: hanrui1sensetime <[email protected]>
Co-authored-by: Johannes L <[email protected]>
Co-authored-by: RunningLeon <[email protected]>
Co-authored-by: VVsssssk <[email protected]>
Co-authored-by: lvhan028 <[email protected]>
Co-authored-by: Yifan Zhou <[email protected]>
Co-authored-by: 杨培文 (Yang Peiwen) <[email protected]>
Co-authored-by: Semyon Bevzyuk <[email protected]>
  • Loading branch information
11 people committed Apr 1, 2022
1 parent ef352d1 commit db79695
Show file tree
Hide file tree
Showing 62 changed files with 1,582 additions and 245 deletions.
1 change: 1 addition & 0 deletions configs/_base_/backends/torchscript.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
backend_config = dict(type='torchscript')
6 changes: 6 additions & 0 deletions configs/_base_/torchscript_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
ir_config = dict(
type='torchscript',
save_file='end2end.pt',
input_names=['input'],
output_names=['output'],
input_shape=None)
6 changes: 6 additions & 0 deletions configs/mmcls/classification_torchscript.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
_base_ = [
'../_base_/torchscript_config.py', '../_base_/backends/torchscript.py'
]

ir_config = dict(input_shape=None)
codebase_config = dict(type='mmcls', task='Classification')
4 changes: 4 additions & 0 deletions configs/mmdet/_base_/base_instance-seg_torchscript.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
_base_ = ['./base_torchscript.py']

ir_config = dict(output_names=['dets', 'labels', 'masks'])
codebase_config = dict(post_processing=dict(export_postprocess_mask=False))
16 changes: 16 additions & 0 deletions configs/mmdet/_base_/base_torchscript.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
_base_ = ['../../_base_/torchscript_config.py']

ir_config = dict(output_names=['dets', 'labels'])
codebase_config = dict(
type='mmdet',
task='ObjectDetection',
model_type='end2end',
post_processing=dict(
score_threshold=0.05,
confidence_threshold=0.005, # for YOLOv3
iou_threshold=0.5,
max_output_boxes_per_class=200,
pre_top_k=5000,
keep_top_k=100,
background_label_id=-1,
))
3 changes: 3 additions & 0 deletions configs/mmdet/detection/detection_torchscript.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
_base_ = [
'../_base_/base_torchscript.py', '../../_base_/backends/torchscript.py'
]
4 changes: 4 additions & 0 deletions configs/mmdet/instance-seg/instance-seg_torchscript.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
_base_ = [
'../_base_/base_instance-seg_torchscript.py',
'../../_base_/backends/torchscript.py'
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
_base_ = [
'../../_base_/torchscript_config.py',
'../../_base_/backends/torchscript.py'
]

ir_config = dict(input_shape=None)
codebase_config = dict(type='mmedit', task='SuperResolution')
7 changes: 7 additions & 0 deletions configs/mmocr/text-detection/text-detection_torchscript.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
_base_ = [
'../../_base_/torchscript_config.py',
'../../_base_/backends/torchscript.py'
]

ir_config = dict(input_shape=None)
codebase_config = dict(type='mmocr', task='TextDetection')
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
_base_ = [
'../../_base_/torchscript_config.py',
'../../_base_/backends/torchscript.py'
]

ir_config = dict(input_shape=None)
codebase_config = dict(type='mmocr', task='TextRecognition')
6 changes: 6 additions & 0 deletions configs/mmseg/segmentation_torchscript.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
_base_ = [
'../_base_/torchscript_config.py', '../_base_/backends/torchscript.py'
]

ir_config = dict(input_shape=None)
codebase_config = dict(type='mmseg', task='Segmentation')
6 changes: 6 additions & 0 deletions csrc/backend_ops/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,9 @@ if ("ncnn" IN_LIST MMDEPLOY_TARGET_BACKENDS)
message(STATUS "Build NCNN custom ops")
add_subdirectory(ncnn)
endif ()

# build TorchScript ops
if ("torchscript" IN_LIST MMDEPLOY_TARGET_BACKENDS)
message(STATUS "Build torchsciprt custom ops")
add_subdirectory(torchscript)
endif ()
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
// Copyright (c) OpenMMLab. All rights reserved.
#ifndef COMMON_CUDA_HELPER
#define COMMON_CUDA_HELPER

#include <cublas_v2.h>
#include <cuda.h>

#include <algorithm>

#define CUDA_1D_KERNEL_LOOP(i, n) \
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)

#define THREADS_PER_BLOCK 512

#define DIVUP(m, n) ((m) / (n) + ((m) % (n) > 0))
inline int GET_BLOCKS(const int N) {
int optimal_block_num = DIVUP(N, THREADS_PER_BLOCK);
int max_block_num = 4096;
return std::min(optimal_block_num, max_block_num);
}

#define cudaCheckError() \
{ \
cudaError_t e = cudaGetLastError(); \
if (e != cudaSuccess) { \
printf("Cuda failure %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(e)); \
exit(0); \
} \
}

/**
* Returns a view of the original tensor with its dimensions permuted.
*
* @param[out] dst pointer to the destination tensor
* @param[in] src pointer to the source tensor
* @param[in] src_size shape of the src tensor
* @param[in] permute The desired ordering of dimensions
* @param[in] src_dim dim of src tensor
* @param[in] stream cuda stream handle
*/
template <class scalar_t>
void memcpyPermute(scalar_t* dst, const scalar_t* src, int* src_size, int* permute, int src_dim,
cudaStream_t stream = 0);

template <typename scalar_t>
cublasStatus_t cublasGemmWrap(cublasHandle_t handle, cublasOperation_t transa,
cublasOperation_t transb, int m, int n, int k, const scalar_t* alpha,
const scalar_t* A, int lda, const scalar_t* B, int ldb,
const scalar_t* beta, scalar_t* C, int ldc);

template <typename scalar_t>
__device__ scalar_t bilinear_interpolate(const scalar_t* input, const int height, const int width,
scalar_t y, scalar_t x) {
// deal with cases that inverse elements are out of feature map boundary
if (y < -1.0 || y > height || x < -1.0 || x > width) return 0;

if (y <= 0) y = 0;
if (x <= 0) x = 0;

int y_low = (int)y;
int x_low = (int)x;
int y_high;
int x_high;

if (y_low >= height - 1) {
y_high = y_low = height - 1;
y = (scalar_t)y_low;
} else {
y_high = y_low + 1;
}

if (x_low >= width - 1) {
x_high = x_low = width - 1;
x = (scalar_t)x_low;
} else {
x_high = x_low + 1;
}

scalar_t ly = y - y_low;
scalar_t lx = x - x_low;
scalar_t hy = 1. - ly, hx = 1. - lx;
// do bilinear interpolation
scalar_t v1 = input[y_low * width + x_low];
scalar_t v2 = input[y_low * width + x_high];
scalar_t v3 = input[y_high * width + x_low];
scalar_t v4 = input[y_high * width + x_high];
scalar_t w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;

scalar_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);

return val;
}

#endif // COMMON_CUDA_HELPER
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#include <cmath>
#include <cstdint>

template <typename T>
T bilinear_interpolate_2d(const T *src, const int64_t src_h, const int64_t src_w, const T h,
const T w) {
if (h <= -1 || src_h <= h || w <= -1 || src_w <= w) {
return 0;
}

int64_t h_low = floor(h);
int64_t w_low = floor(w);
int64_t h_high = h_low + 1;
int64_t w_high = w_low + 1;

T lh = h - h_low;
T lw = w - w_low;
T hh = 1 - lh;
T hw = 1 - lw;

T v1 = 0;
if (h_low >= 0 && w_low >= 0) v1 = src[h_low * src_w + w_low];
T v2 = 0;
if (h_low >= 0 && w_high <= src_w - 1) v2 = src[h_low * src_w + w_high];
T v3 = 0;
if (h_high <= src_h - 1 && w_low >= 0) v3 = src[h_high * src_w + w_low];
T v4 = 0;
if (h_high <= src_h - 1 && w_high <= src_w - 1) v4 = src[h_high * src_w + w_high];

T w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;

T val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);
return val;
}

// output: (channels * kernel_h * kernel_w, dst_h * dst_w)
template <typename T>
void deformable_im2col_2d(const T *input, const T *offset, const T *mask, const int64_t src_h,
const int64_t src_w, const int64_t kernel_h, const int64_t kernel_w,
const int64_t pad_h, const int64_t pad_w, const int64_t stride_h,
const int64_t stride_w, const int64_t dilation_h,
const int64_t dilation_w, const int64_t channels,
const int64_t offset_groups, const int64_t dst_h, const int64_t dst_w,
const bool use_mask, T *columns) {
const int64_t workload = channels * dst_h * dst_w;
for (int64_t index = 0; index != workload; ++index) {
const int64_t ow = index % dst_w;
const int64_t oh = (index / dst_w) % dst_h;
const int64_t ic = index / (dst_w * dst_h);
const int64_t oc = ic * kernel_h * kernel_w;

int64_t c_per_offset_grp = channels / offset_groups;
const int64_t grp_idx = ic / c_per_offset_grp;

auto columns_ptr = columns + (oc * (dst_h * dst_w) + oh * dst_w + ow);
auto input_ptr = input + ic * (src_h * src_w);
auto offset_ptr = offset + grp_idx * 2 * kernel_h * kernel_w * dst_h * dst_w;
auto mask_ptr = mask;
if (use_mask) {
mask_ptr += grp_idx * kernel_h * kernel_w * dst_h * dst_w;
}

for (int64_t kh = 0; kh < kernel_h; ++kh) {
for (int64_t kw = 0; kw < kernel_w; ++kw) {
const int64_t mask_idx = kh * kernel_w + kw;
const int64_t offset_idx = 2 * mask_idx;

T mask_value = 1;
if (use_mask) {
mask_value = mask_ptr[mask_idx * (dst_h * dst_w) + oh * dst_w + ow];
}

const T offset_h = offset_ptr[offset_idx * (dst_h * dst_w) + oh * dst_w + ow];
const T offset_w = offset_ptr[(offset_idx + 1) * (dst_h * dst_w) + oh * dst_w + ow];
const T ih = (oh * stride_h - pad_h) + kh * dilation_h + offset_h;
const T iw = (ow * stride_w - pad_w) + kw * dilation_w + offset_w;
*columns_ptr = mask_value * bilinear_interpolate_2d<T>(input_ptr, src_h, src_w, ih, iw);
columns_ptr += dst_h * dst_w;
}
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@

#include <float.h>

#include "common_cuda_helper.hpp"
#include "common_cuda_helper.cuh"

template <typename T>
__device__ T dmcn_im2col_bilinear(const T *input, const int data_width, const int height,
Expand Down
1 change: 1 addition & 0 deletions csrc/backend_ops/onnxruntime/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ mmdeploy_export(${PROJECT_NAME}_obj)
target_include_directories(${PROJECT_NAME}_obj PUBLIC
$<BUILD_INTERFACE:${ONNXRUNTIME_DIR}/include>
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/common>
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/../common>
$<BUILD_INTERFACE:${CMAKE_SOURCE_DIR}/csrc>)
target_link_directories(${PROJECT_NAME}_obj PUBLIC
${ONNXRUNTIME_DIR}/lib)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,88 +4,11 @@
#include <cmath>
#include <vector>

#include "modulated_deform_conv/modulated_deform_conv_cpu.h"
#include "ort_utils.h"

namespace mmdeploy {

float bilinear_interpolate_2d(const float *src, const int64_t src_h, const int64_t src_w,
const float h, const float w) {
if (h <= -1 || src_h <= h || w <= -1 || src_w <= w) {
return 0;
}

int64_t h_low = floor(h);
int64_t w_low = floor(w);
int64_t h_high = h_low + 1;
int64_t w_high = w_low + 1;

float lh = h - h_low;
float lw = w - w_low;
float hh = 1 - lh;
float hw = 1 - lw;

float v1 = 0;
if (h_low >= 0 && w_low >= 0) v1 = src[h_low * src_w + w_low];
float v2 = 0;
if (h_low >= 0 && w_high <= src_w - 1) v2 = src[h_low * src_w + w_high];
float v3 = 0;
if (h_high <= src_h - 1 && w_low >= 0) v3 = src[h_high * src_w + w_low];
float v4 = 0;
if (h_high <= src_h - 1 && w_high <= src_w - 1) v4 = src[h_high * src_w + w_high];

float w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;

float val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);
return val;
}

// output: (channels * kernel_h * kernel_w, dst_h * dst_w)
void deformable_im2col_2d(const float *input, const float *offset, const float *mask,
const int64_t src_h, const int64_t src_w, const int64_t kernel_h,
const int64_t kernel_w, const int64_t pad_h, const int64_t pad_w,
const int64_t stride_h, const int64_t stride_w, const int64_t dilation_h,
const int64_t dilation_w, const int64_t channels,
const int64_t offset_groups, const int64_t dst_h, const int64_t dst_w,
const bool use_mask, float *columns) {
const int64_t workload = channels * dst_h * dst_w;
for (int64_t index = 0; index != workload; ++index) {
const int64_t ow = index % dst_w;
const int64_t oh = (index / dst_w) % dst_h;
const int64_t ic = index / (dst_w * dst_h);
const int64_t oc = ic * kernel_h * kernel_w;

int64_t c_per_offset_grp = channels / offset_groups;
const int64_t grp_idx = ic / c_per_offset_grp;

auto columns_ptr = columns + (oc * (dst_h * dst_w) + oh * dst_w + ow);
auto input_ptr = input + ic * (src_h * src_w);
auto offset_ptr = offset + grp_idx * 2 * kernel_h * kernel_w * dst_h * dst_w;
auto mask_ptr = mask;
if (use_mask) {
mask_ptr += grp_idx * kernel_h * kernel_w * dst_h * dst_w;
}

for (int64_t kh = 0; kh < kernel_h; ++kh) {
for (int64_t kw = 0; kw < kernel_w; ++kw) {
const int64_t mask_idx = kh * kernel_w + kw;
const int64_t offset_idx = 2 * mask_idx;

float mask_value = 1;
if (use_mask) {
mask_value = mask_ptr[mask_idx * (dst_h * dst_w) + oh * dst_w + ow];
}

const float offset_h = offset_ptr[offset_idx * (dst_h * dst_w) + oh * dst_w + ow];
const float offset_w = offset_ptr[(offset_idx + 1) * (dst_h * dst_w) + oh * dst_w + ow];
const float ih = (oh * stride_h - pad_h) + kh * dilation_h + offset_h;
const float iw = (ow * stride_w - pad_w) + kw * dilation_w + offset_w;
*columns_ptr = mask_value * bilinear_interpolate_2d(input_ptr, src_h, src_w, ih, iw);
columns_ptr += dst_h * dst_w;
}
}
}
}

void gemm_ref_fp32(const float *A, const float *B, const float *V, const float *H,
const int32_t trans_A, const int32_t trans_B, const int32_t M, const int32_t N,
const int32_t K, const float alpha, const float beta, float *Y) {
Expand Down Expand Up @@ -162,12 +85,12 @@ void deformable_conv2d_ref_fp32(const float *src, const float *offset, const flo

for (int64_t b = 0; b < batch; ++b) {
for (int64_t g = 0; g < group; ++g) {
deformable_im2col_2d(src + b * src_c * src_h * src_w + g * ic_per_gp * src_h * src_w,
offset + b * offset_group * 2 * kernel_h * kernel_w * dst_h * dst_w,
mask + b * offset_group * kernel_h * kernel_w * dst_h * dst_w, src_h,
src_w, kernel_h, kernel_w, pad_h, pad_w, stride_h, stride_w, dilation_h,
dilation_w, ic_per_gp, offset_group, dst_h, dst_w, mask != nullptr,
columns);
deformable_im2col_2d<float>(
src + b * src_c * src_h * src_w + g * ic_per_gp * src_h * src_w,
offset + b * offset_group * 2 * kernel_h * kernel_w * dst_h * dst_w,
mask + b * offset_group * kernel_h * kernel_w * dst_h * dst_w, src_h, src_w, kernel_h,
kernel_w, pad_h, pad_w, stride_h, stride_w, dilation_h, dilation_w, ic_per_gp,
offset_group, dst_h, dst_w, mask != nullptr, columns);
float *dst_ptr = dst + b * dst_c * dst_h * dst_w + g * oc_per_gp * dst_h * dst_w;
if (bias != nullptr) {
const float *bias_ptr = bias + g * oc_per_gp;
Expand Down
Loading

0 comments on commit db79695

Please sign in to comment.