-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 [Bug] [Dynamic Shapes] Encountered bug when using Torch-TensorRT #3140
Comments
@narendasan can you help me slove these problem? I want to set the dynamic shape in batch size & seq_len |
@narendasan when to support torch_executed_modules in dynamo mode? |
Hi @yjjinjie you can set the dynamic shapes and pass in the dynamic inputs using
where
where the first two (1, 8, 16) and (1, 2, 3) denote the batch_size and seq_len respectively. Can you try with this and see if you get the same error as above? |
yes,I have tried the torch_tensorrt.Input. but it encountered a new bug
the error is:
|
I also tried the dynamic_shapes: https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html
it has the same problem as the torch._dynamo.mark_dynamic(a, 0,min=1,max=8196) |
@apbose can you help me? |
Yeah sure, let me take a look and get back on this. |
Hi @yjjinjie may I know where can I find tzrec? because it shows module not found tzrec |
@apbose you can just delete tzrec and mlp code just like this :
|
I do not get the above error when I run the above code. Are you running on the latest branch. I did a few modifications in the code though-
|
@apbose I use the torch_tensorrt 2.4.0, and use your code, it also has the same error. your torch_tensorrt version is? |
my env is:
|
@apbose I use pip install --pre torch-tensorrt --index-url https://download.pytorch.org/whl/nightly/cu124 to install torch_tensorrt 2.5.0.dev20240822+cu124 then your code is correct, when do you release 2.5.0? I cannot install pip install https://download.pytorch.org/whl/nightly/cu124/torch-2.6.0.dev20241013%2Bcu124-cp311-cp311-linux_x86_64.whl, becase of the error:
|
@apbose in my real code , it has another error: when I use thetorch_tensorrt 2.5.0.dev20240822+cu124 , when I use torch_tensorrt 2.4.0; dynamic the error is:
the code is :
can you help me solve this problem @apbose |
when I use the nvcr.io/nvidia/pytorch:24.09-py3, then the code is ok.
2.5.0a0 is which day of torch_tensorrt? but the docker image system is incompatible with my project, when to release the new version 2.5.0? |
Hi @yjjinjie you can find the release wheels here- https://download.pytorch.org/whl/test/torch-tensorrt/. The torchTRT 2.5 release artifacts got pushed in officially yesterday. |
@apbose hello,when i install torch_tensorrt==2.5.0, it also has error
when I use the nvcr.io/nvidia/pytorch:24.09-py3, then the code is ok. torch 2.5.0a0+b465a5843b.nv24.9 2.5.0a0 is which day of torch_tensorrt? can you update the version of 2.5.0? because I want to install torch_tensorrt in my project |
Can you try with a new virtual env and install torch tensorrt from here- https://download.pytorch.org/whl/test/torch-tensorrt/ the wheel torch_tensorrt-2.5.0+cu124-cp310-cp310-linux_x86_64.whl. This will have torch-tensorrt 2.5 and torch 2.5. And let me know what the error is? |
@apbose I new a new virtual env ,and install torch_tensorrt-2.5.0+cu124-cp310-cp310-linux_x86_64.whl. it has same error . only run:
and run collect_env:
the result:
the code is:
the error:
|
@apbose can you help me solve this problem? |
Yes taking a look.
…On Wed, Oct 23, 2024, 7:31 PM yjjinjie ***@***.***> wrote:
@apbose <https://github.com/apbose> can you help me solve this problem?
—
Reply to this email directly, view it on GitHub
<#3140 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKRJMR3R6TP5KREA3SVZOALZ5BLXRAVCNFSM6AAAAABNROACW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZUGEYDQNBRG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I did not get a chance to look at this one yet, but let me get back to you soon regarding this |
I could repro the error-
on torchTRT2.4. I am yet to try on torchTRT2.5 and torchTRT2.6. Will try that and update here.
|
yes. in torchTRT2.4, it has the error: ValueError: len() should return >= 0 in torchTrt2.5 release , it has the error: NameError: name 's0' is not defined |
Hmm so the thing is in torchTRT2.5 docker container I see it passing. It is failing in 2.4 with the error
void genericReformat::copyPackedKernel<float, float,... 0.00% 0.000us 0.00% 0.000us 0.000us 3.680us 36.62% 3.680us 1.840us 2 Self CPU time total: 2.528ms load: tensor(0.4938, device='cuda:0') |
@apbose hello,I use the image, docker pull ghcr.io/pytorch/tensorrt/torch_tensorrt:release_2.5, it has the same error please use the below code, your code may be not same with me,because my new code output is multi-demension.
the error:
when I use the nvcr.io/nvidia/pytorch:24.09-py3 ,the code is correct, the output is
the torch-trt 2.5 & image it has error,please give me the release whl to install in my project |
@apbose Could you please help expedite the positioning? Our project has been delayed for a long time in introducing this trt feature. thanks~~~ |
@apbose can you help me solve this problem ? your code may be the original code, is not the newer code |
Are you using your own docker image and using torchtrt docker image as the
base image?
…On Tue, Nov 5, 2024, 5:32 PM yjjinjie ***@***.***> wrote:
@apbose <https://github.com/apbose> can you help me solve this problem ?
your code may be the original code, is not the newer code
—
Reply to this email directly, view it on GitHub
<#3140 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKRJMR7LPPCKFCHYC7RC2QTZ7FWURAVCNFSM6AAAAABNROACW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJYGUZDGOJVGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I just use
|
@apbose can you see the issues, I think you use the original code, not my newer code |
@peri044 please use the below code and run in ghcr.io/pytorch/tensorrt/torch_tensorrt:release_2.5 image,it gets error
|
ok trying now, could repro the error with the additional layers. I was trying the old code before which was missing the mlp layers. The error seems to come from those. |
Tried a couple of experiments
This gives me -
which means the torch export would want the seq dimension to be equal. The below
goes past the above but again results in
Looking into this further. |
@apbose yes. I also tried the dynamic_shapes too,it has the same error--NameError: name 's0' is not defined. you can use these to solve first error The values of seq_len_b_zero = L['args1'].size()[0] and seq_len_a_zero = L['args0'].size()[0] must always be equal. use the same dim seq_len_a_zero, dynamic_shapes=({0:seq_len_a_zero}, {0:seq_len_a_zero, 1: seq_len_b_one}, {0:seq_len_a_zero}) I think you can see the difference between the trt2.5 and nvcr.io/nvidia/pytorch:24.09-py3 ,becase the nvcr.io/nvidia/pytorch:24.09-py3 trt is ok,but it has no release whl |
Aah ok, thanks for pointing it out @yjjinjie . So you mean the above example passes for nvcr.io/nvidia/pytorch:24.09-py3? |
Ok interesting looks like it is passing there |
The issue is coming from the lowering pass |
@apbose yes, when to fix this issue? |
working on the fix will raise PR by next Monday |
@apbose thanks. when the pr merged,can you give me the release whl which is compatible with torch2.5.0? |
Raised #3289. Yeah ok, I can help you with that. Locally can cherry pick this PR on top of 2.5 to create the compatible wheel. |
@apbose yes. I just Manually modify the code, it's ok |
@apbose in my real project, I use the trt scripted model, then predict,Occasional Anomaly in Accuracy? |
@apbose this is my project,I just pull 2 pr to add torch-tensorrt. What could be the possible reasons? I use torch_tensorrt.runtime.set_multi_device_safe_mode(True), It can reduce the frequency of errors, but they still occur occasionally. https://github.com/alibaba/TorchEasyRec/pull/30/files but when I predict in test_multi_tower_with_fg_train_eval_export_trt, the accuracy is Occasional Anomaly... |
Could you please provide me a bit more context on what are the two PRs which you have pulled? Also could you give simple repro code of what the |
@apbose the program is so large, I need some time to give code. I just set the dynamic shap is
but the trt get the len is 50,and it is random error. What can lead to randomness within TRT?
|
@apbose I make some experiments in 4 machine. and find : echo 'options nvidia NVreg_EnableGpuFirmware=1' > /etc/modprobe.d/nvidia-gsp.conf when I enableGPS, the accuracy is OK. -------GSP Firmware Version : 535.161.08 why torch-tensorrt is related with GSP? |
@apbose when I slove the GSP problem,it's accuracy is random incorrect, 1% probablity?what's reason may casue random accuracy? multi-stream can i disable it? |
Hi @yjjinjie, some questions
|
@apbose my code is just like this, the dynamic shape error only appear once
then when I grep the nvidia-smi, the gsp |
Bug Description
when I use dynamic shape in trt, will raise error,
the static shape is ok.just delete these
To Reproduce
Steps to reproduce the behavior:
the env:
The text was updated successfully, but these errors were encountered: