Skip to content

Commit

Permalink
Update gh-pages for pic hyperlinks. (#1973)
Browse files Browse the repository at this point in the history
Co-authored-by: Guoming Zhang <[email protected]>
  • Loading branch information
nv-guomingz and Guoming Zhang authored Jul 17, 2024
1 parent df5423f commit 10588d0
Show file tree
Hide file tree
Showing 52 changed files with 5,905 additions and 5,905 deletions.
2 changes: 1 addition & 1 deletion _cpp_gen/executor.html
Original file line number Diff line number Diff line change
Expand Up @@ -4724,7 +4724,7 @@ <h2>types.h<a class="headerlink" href="#types-h" title="Link to this heading">
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf1b5ad70>
<jinja2.runtime.BlockReference object at 0x7f8a046e6800>

<div class="footer">
<p>
Expand Down
11,586 changes: 5,793 additions & 5,793 deletions _cpp_gen/runtime.html

Large diffs are not rendered by default.

120 changes: 60 additions & 60 deletions _sources/_cpp_gen/runtime.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,66 +28,6 @@ ____________
.. doxygenfile:: cudaStream.h
:project: TensorRT-LLM

generationInput.h
_________________

.. doxygenfile:: generationInput.h
:project: TensorRT-LLM

generationOutput.h
__________________

.. doxygenfile:: generationOutput.h
:project: TensorRT-LLM

ipcUtils.h
__________

.. doxygenfile:: ipcUtils.h
:project: TensorRT-LLM

loraCache.h
___________

.. doxygenfile:: loraCache.h
:project: TensorRT-LLM

loraCachePageManagerConfig.h
____________________________

.. doxygenfile:: loraCachePageManagerConfig.h
:project: TensorRT-LLM

loraModule.h
____________

.. doxygenfile:: loraModule.h
:project: TensorRT-LLM

memoryCounters.h
________________

.. doxygenfile:: memoryCounters.h
:project: TensorRT-LLM

promptTuningParams.h
____________________

.. doxygenfile:: promptTuningParams.h
:project: TensorRT-LLM

tllmLogger.h
____________

.. doxygenfile:: tllmLogger.h
:project: TensorRT-LLM

worldConfig.h
_____________

.. doxygenfile:: worldConfig.h
:project: TensorRT-LLM

decodingInput.h
_______________

Expand All @@ -106,6 +46,18 @@ ____________________________
.. doxygenfile:: explicitDraftTokensBuffers.h
:project: TensorRT-LLM

generationInput.h
_________________

.. doxygenfile:: generationInput.h
:project: TensorRT-LLM

generationOutput.h
__________________

.. doxygenfile:: generationOutput.h
:project: TensorRT-LLM

gptDecoder.h
____________

Expand Down Expand Up @@ -154,24 +106,60 @@ _________
.. doxygenfile:: iTensor.h
:project: TensorRT-LLM

ipcUtils.h
__________

.. doxygenfile:: ipcUtils.h
:project: TensorRT-LLM

lookaheadModule.h
_________________

.. doxygenfile:: lookaheadModule.h
:project: TensorRT-LLM

loraCache.h
___________

.. doxygenfile:: loraCache.h
:project: TensorRT-LLM

loraCachePageManagerConfig.h
____________________________

.. doxygenfile:: loraCachePageManagerConfig.h
:project: TensorRT-LLM

loraModule.h
____________

.. doxygenfile:: loraModule.h
:project: TensorRT-LLM

medusaModule.h
______________

.. doxygenfile:: medusaModule.h
:project: TensorRT-LLM

memoryCounters.h
________________

.. doxygenfile:: memoryCounters.h
:project: TensorRT-LLM

modelConfig.h
_____________

.. doxygenfile:: modelConfig.h
:project: TensorRT-LLM

promptTuningParams.h
____________________

.. doxygenfile:: promptTuningParams.h
:project: TensorRT-LLM

rawEngine.h
___________

Expand Down Expand Up @@ -202,3 +190,15 @@ ___________________________
.. doxygenfile:: speculativeDecodingModule.h
:project: TensorRT-LLM

tllmLogger.h
____________

.. doxygenfile:: tllmLogger.h
:project: TensorRT-LLM

worldConfig.h
_____________

.. doxygenfile:: worldConfig.h
:project: TensorRT-LLM

2 changes: 1 addition & 1 deletion _sources/blogs/XQA-kernel.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Support matrix and usage flags are described in [docs/source/advanced/gpt_attent
Looking at the Throughput-Latency curves below, we see that the enabling of XQA optimization increases throughput. Higher throughput equates to serving more users, and we can see that TPOT on the Y-axis flattens out when XQA gets enabled.


<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/media/XQA_ThroughputvsLatency.png" alt="XQA increased throughput within same latency budget" width="950" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/XQA_ThroughputvsLatency.png?raw=true" alt="XQA increased throughput within same latency budget" width="950" height="auto">

<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT-LLM v0.8a</sub>

Expand Down
2 changes: 1 addition & 1 deletion _sources/performance/perf-best-practices.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ runtime and, for some of them, decrease the engine build time.
### `max_batch_size`, `max_seq_len` and `max_num_tokens`

<p align="center">
<img src="../media/max_bs_toks_len.svg" alt="Explain `max_batch_size`, `max_seq_len` and `max_num_tokens`" width="30%" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/media/max_bs_toks_len.svg?raw=true" alt="Explain `max_batch_size`, `max_seq_len` and `max_num_tokens`" width="30%" height="auto">
</p>

Regarding the impacts of those three arguments to the GPU memory usage, please refer to [memory.md](../reference/memory.md)
Expand Down
2 changes: 1 addition & 1 deletion _sources/speculative_decoding.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@ Consider the following diagram, which illustrates how the hidden states from the
are passed to the base model's language model (LM) head and to four Medusa heads (MHs).

<p align="center">
<img src="./media/medusa_tree.svg" alt="Example Medusa Tree" width="auto" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/media/medusa_tree.svg?raw=true" alt="Example Medusa Tree" width="auto" height="auto">
</p>

In this example:
Expand Down
2 changes: 1 addition & 1 deletion advanced/batch-manager.html
Original file line number Diff line number Diff line change
Expand Up @@ -411,7 +411,7 @@ <h2>In-flight Batching with the Triton Inference Server<a class="headerlink" hre
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf305c430>
<jinja2.runtime.BlockReference object at 0x7f8a09129930>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion advanced/expert-parallelism.html
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ <h2>How to Enable<a class="headerlink" href="#how-to-enable" title="Link to this
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3045b40>
<jinja2.runtime.BlockReference object at 0x7f8a0912bf10>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion advanced/gpt-attention.html
Original file line number Diff line number Diff line change
Expand Up @@ -486,7 +486,7 @@ <h3>Relative Attention Bias (RAB)<a class="headerlink" href="#relative-attention
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf2fca320>
<jinja2.runtime.BlockReference object at 0x7f8a091076d0>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion advanced/gpt-runtime.html
Original file line number Diff line number Diff line change
Expand Up @@ -378,7 +378,7 @@ <h2>Know Issues and Future Changes<a class="headerlink" href="#know-issues-and-f
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3045180>
<jinja2.runtime.BlockReference object at 0x7f8a090f4f10>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion advanced/graph-rewriting.html
Original file line number Diff line number Diff line change
Expand Up @@ -349,7 +349,7 @@ <h2>Classical Workflow<a class="headerlink" href="#classical-workflow" title="Li
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3d06110>
<jinja2.runtime.BlockReference object at 0x7f8a092e8430>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion advanced/inference-request.html
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@ <h1>Responses<a class="headerlink" href="#responses" title="Link to this heading
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3be3220>
<jinja2.runtime.BlockReference object at 0x7f8a09121810>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion advanced/lora.html
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,7 @@ <h3>LoRA with tensor parallel<a class="headerlink" href="#lora-with-tensor-paral
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3045de0>
<jinja2.runtime.BlockReference object at 0x7f8a092eb700>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion advanced/weight-streaming.html
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ <h2>API Changes<a class="headerlink" href="#api-changes" title="Link to this hea
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3cc2cb0>
<jinja2.runtime.BlockReference object at 0x7f8a091051e0>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion architecture/add-model.html
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ <h2>Reference<a class="headerlink" href="#reference" title="Link to this heading
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3180c40>
<jinja2.runtime.BlockReference object at 0x7f8a08927550>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion architecture/checkpoint.html
Original file line number Diff line number Diff line change
Expand Up @@ -506,7 +506,7 @@ <h2>Make Evaluation<a class="headerlink" href="#make-evaluation" title="Link to
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3be1150>
<jinja2.runtime.BlockReference object at 0x7f8a09121720>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion architecture/core-concepts.html
Original file line number Diff line number Diff line change
Expand Up @@ -377,7 +377,7 @@ <h1>Runtime<a class="headerlink" href="#runtime" title="Link to this heading">
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3044bb0>
<jinja2.runtime.BlockReference object at 0x7f8a09397220>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion architecture/overview.html
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ <h2>Model Weights<a class="headerlink" href="#model-weights" title="Link to this
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3250520>
<jinja2.runtime.BlockReference object at 0x7f8a08941990>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion architecture/workflow.html
Original file line number Diff line number Diff line change
Expand Up @@ -336,7 +336,7 @@ <h2>CLI Tools<a class="headerlink" href="#cli-tools" title="Link to this heading
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3c0a6b0>
<jinja2.runtime.BlockReference object at 0x7f8a089629e0>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion blogs/Falcon180B-H200.html
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,7 @@ <h3>Closing<a class="headerlink" href="#closing" title="Link to this heading">
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3c09720>
<jinja2.runtime.BlockReference object at 0x7f8a093abd00>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion blogs/H100vsA100.html
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ <h2>What is H100 FP8?<a class="headerlink" href="#what-is-h100-fp8" title="Link
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3d17310>
<jinja2.runtime.BlockReference object at 0x7f8a08ce38e0>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion blogs/H200launch.html
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,7 @@ <h2>Latest HBM Memory<a class="headerlink" href="#latest-hbm-memory" title="Link
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf2d1e740>
<jinja2.runtime.BlockReference object at 0x7f8a08d0ca90>

<div class="footer">
<p>
Expand Down
4 changes: 2 additions & 2 deletions blogs/XQA-kernel.html
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ <h1>New XQA-kernel provides 2.4x more Llama-70B throughput within the same laten
<p>Support matrix and usage flags are described in <a class="reference internal" href="#/docs/source/advanced/gpt-attention.md#xqa-optimization"><span class="xref myst">docs/source/advanced/gpt_attention</span></a>.</p>
<p><strong>Increased Throughput:</strong>
Looking at the Throughput-Latency curves below, we see that the enabling of XQA optimization increases throughput. Higher throughput equates to serving more users, and we can see that TPOT on the Y-axis flattens out when XQA gets enabled.</p>
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/media/XQA_ThroughputvsLatency.png" alt="XQA increased throughput within same latency budget" width="950" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/XQA_ThroughputvsLatency.png?raw=true" alt="XQA increased throughput within same latency budget" width="950" height="auto">
<p><sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT-LLM v0.8a</sub></p>
<section id="llama-70b-on-h200-up-to-2-4x-increased-throughput-with-xqa-within-same-latency-budget">
<h2>Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget<a class="headerlink" href="#llama-70b-on-h200-up-to-2-4x-increased-throughput-with-xqa-within-same-latency-budget" title="Link to this heading"></a></h2>
Expand Down Expand Up @@ -204,7 +204,7 @@ <h3>Closing<a class="headerlink" href="#closing" title="Link to this heading">
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3d169e0>
<jinja2.runtime.BlockReference object at 0x7f8a08e203a0>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion blogs/quantization-in-TRT-LLM.html
Original file line number Diff line number Diff line change
Expand Up @@ -359,7 +359,7 @@ <h2>What’s coming next<a class="headerlink" href="#whats-coming-next" title="L
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3c09b70>
<jinja2.runtime.BlockReference object at 0x7f8a08d688b0>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion executor.html
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ <h2>Python Bindings for the Executor API<a class="headerlink" href="#python-bind
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3d174f0>
<jinja2.runtime.BlockReference object at 0x7f8a04c7b640>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -3773,7 +3773,7 @@ <h2 id="T">T</h2>
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf87c1ba0>
<jinja2.runtime.BlockReference object at 0x7f8a08c552d0>

<div class="footer">
<p>
Expand Down
2 changes: 1 addition & 1 deletion index.html
Original file line number Diff line number Diff line change
Expand Up @@ -364,7 +364,7 @@ <h1>Indices and tables<a class="headerlink" href="#indices-and-tables" title="Li
<hr/>

<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fedf3c08ee0>
<jinja2.runtime.BlockReference object at 0x7f8a08e59ae0>

<div class="footer">
<p>
Expand Down
Loading

0 comments on commit 10588d0

Please sign in to comment.