You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
run any model by mlc_llm serve and set --enable-tracing --enable-debug arguments, for example: mlc_llm serve /workdir/Qwen2-1.5B-Instruct-mlc/ --device cuda --model-lib /workdir/Qwen2-1.5B-Instruct-mlc/qwen2-1.5b.so --port 8090 --host 0.0.0.0 --enable-tracing --enable-debug
get the Chrome Trace by curl -X POST http://127.0.0.1:8000/debug/dump_event_trace -H "Content-Type: application/json" -d '{"model": "dist/llama"}'
parse the tracing log, you will find the softmax operator takes over 65% of the total time.
embedding (12) time cost: 0.129 ms
apply logit bias (12) time cost: 0.004 ms
apply penalty (12) time cost: 0.005 ms
apply logit mask (12) time cost: 0.004 ms
update logits (12) time cost: 0.024 ms
softmax (12) time cost: 6.229 ms
renormalization by top p (12) time cost: 0.21 ms
sampling (12) time cost: 0.114 ms
detokenization (12) time cost: 0.052 ms
callback (12) time cost: 0.104 ms
decode (12) time cost: 2.51 ms
Expected behavior
Environment
Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu22.04
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
Any other relevant information:
Additional context
The text was updated successfully, but these errors were encountered:
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
mlc_llm serve
and set--enable-tracing --enable-debug
arguments, for example:mlc_llm serve /workdir/Qwen2-1.5B-Instruct-mlc/ --device cuda --model-lib /workdir/Qwen2-1.5B-Instruct-mlc/qwen2-1.5b.so --port 8090 --host 0.0.0.0 --enable-tracing --enable-debug
curl -X POST http://127.0.0.1:8000/debug/dump_event_trace -H "Content-Type: application/json" -d '{"model": "dist/llama"}'
Expected behavior
Environment
conda
, source): sourcepip
, source): sourcepython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):Additional context
The text was updated successfully, but these errors were encountered: