Skip to content

Commit

Permalink
Adds write-coalescing code path optimization to FST (#16143)
Browse files Browse the repository at this point in the history
This PR adds an optimized code path to the finite-state transducer (FST) that will use a shared memory-backed write buffer for the translated output and translated output indexes, if the the write buffer does not require allocating excessive amounts of shared memory (i.e., current heuristic is 24 KB/CTA). Writes are first buffered in shared memory and then collaboratively written out using coalesced writes to global memory.

## Benchmark results

Numbers are for libcudf's FST_NVBENCH for a 1.073 GB input. FST outputs one token per input symbol. Benchmarks run on V100 with 900 GB/s theoretical peak BW. 
We compare the current FST implementation (old) to an FST implementaation that uses write-coalescing to gmem (new). 

|                  | OLD throughput  (GB/s) | NEW throughput  (GB/s) | relative performance |   | 1st kernel, per byte: bytes read/written | 2nd kernel, per byte: bytes read/written | expected SOL (GB/s) | achieved SOL (old) | achieved SOL (new) |
|------------------|------------------------|------------------------|----------------------|---|------------------------------------------|------------------------------------------|---------------------|--------------------|--------------------|
| full             |                   15.7 |                  74.74 |                 476% |   |                                        1 |                                        6 |              102.86 |             15.26% |             72.66% |
| no out-indexes   |                 39.123 |                  105.8 |                 270% |   |                                        1 |                                        2 |              240.00 |             16.30% |             44.08% |
| no-output        |                 229.27 |                 178.92 |                  78% |   |                                        1 |                                        1 |              360.00 |             63.69% |             49.70% |
| out-indexes-only |                  24.95 |                   85.2 |                 341% |   |                                        1 |                                        5 |              120.00 |             20.79% |             71.00% |

Authors:
  - Elias Stehle (https://github.com/elstehle)

Approvers:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #16143
  • Loading branch information
elstehle authored Jul 23, 2024
1 parent ff30c02 commit cd71191
Show file tree
Hide file tree
Showing 8 changed files with 425 additions and 98 deletions.
16 changes: 12 additions & 4 deletions cpp/benchmarks/io/fst.cu
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,9 @@ void BM_FST_JSON(nvbench::state& state)
auto parser = cudf::io::fst::detail::make_fst(
cudf::io::fst::detail::make_symbol_group_lut(pda_sgs),
cudf::io::fst::detail::make_transition_table(pda_state_tt),
cudf::io::fst::detail::make_translation_table<max_translation_table_size>(pda_out_tt),
cudf::io::fst::detail::make_translation_table<max_translation_table_size,
min_translated_out,
max_translated_out>(pda_out_tt),
stream);

state.set_cuda_stream(nvbench::make_cuda_stream_view(stream.value()));
Expand Down Expand Up @@ -134,7 +136,9 @@ void BM_FST_JSON_no_outidx(nvbench::state& state)
auto parser = cudf::io::fst::detail::make_fst(
cudf::io::fst::detail::make_symbol_group_lut(pda_sgs),
cudf::io::fst::detail::make_transition_table(pda_state_tt),
cudf::io::fst::detail::make_translation_table<max_translation_table_size>(pda_out_tt),
cudf::io::fst::detail::make_translation_table<max_translation_table_size,
min_translated_out,
max_translated_out>(pda_out_tt),
stream);

state.set_cuda_stream(nvbench::make_cuda_stream_view(stream.value()));
Expand Down Expand Up @@ -171,7 +175,9 @@ void BM_FST_JSON_no_out(nvbench::state& state)
auto parser = cudf::io::fst::detail::make_fst(
cudf::io::fst::detail::make_symbol_group_lut(pda_sgs),
cudf::io::fst::detail::make_transition_table(pda_state_tt),
cudf::io::fst::detail::make_translation_table<max_translation_table_size>(pda_out_tt),
cudf::io::fst::detail::make_translation_table<max_translation_table_size,
min_translated_out,
max_translated_out>(pda_out_tt),
stream);

state.set_cuda_stream(nvbench::make_cuda_stream_view(stream.value()));
Expand Down Expand Up @@ -209,7 +215,9 @@ void BM_FST_JSON_no_str(nvbench::state& state)
auto parser = cudf::io::fst::detail::make_fst(
cudf::io::fst::detail::make_symbol_group_lut(pda_sgs),
cudf::io::fst::detail::make_transition_table(pda_state_tt),
cudf::io::fst::detail::make_translation_table<max_translation_table_size>(pda_out_tt),
cudf::io::fst::detail::make_translation_table<max_translation_table_size,
min_translated_out,
max_translated_out>(pda_out_tt),
stream);

state.set_cuda_stream(nvbench::make_cuda_stream_view(stream.value()));
Expand Down
Loading

0 comments on commit cd71191

Please sign in to comment.