Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds write-coalescing code path optimization to FST (#16143)
This PR adds an optimized code path to the finite-state transducer (FST) that will use a shared memory-backed write buffer for the translated output and translated output indexes, if the the write buffer does not require allocating excessive amounts of shared memory (i.e., current heuristic is 24 KB/CTA). Writes are first buffered in shared memory and then collaboratively written out using coalesced writes to global memory. ## Benchmark results Numbers are for libcudf's FST_NVBENCH for a 1.073 GB input. FST outputs one token per input symbol. Benchmarks run on V100 with 900 GB/s theoretical peak BW. We compare the current FST implementation (old) to an FST implementaation that uses write-coalescing to gmem (new). | | OLD throughput (GB/s) | NEW throughput (GB/s) | relative performance | | 1st kernel, per byte: bytes read/written | 2nd kernel, per byte: bytes read/written | expected SOL (GB/s) | achieved SOL (old) | achieved SOL (new) | |------------------|------------------------|------------------------|----------------------|---|------------------------------------------|------------------------------------------|---------------------|--------------------|--------------------| | full | 15.7 | 74.74 | 476% | | 1 | 6 | 102.86 | 15.26% | 72.66% | | no out-indexes | 39.123 | 105.8 | 270% | | 1 | 2 | 240.00 | 16.30% | 44.08% | | no-output | 229.27 | 178.92 | 78% | | 1 | 1 | 360.00 | 63.69% | 49.70% | | out-indexes-only | 24.95 | 85.2 | 341% | | 1 | 5 | 120.00 | 20.79% | 71.00% | Authors: - Elias Stehle (https://github.com/elstehle) Approvers: - Shruti Shivakumar (https://github.com/shrshi) - Vukasin Milovanovic (https://github.com/vuule) URL: #16143
- Loading branch information