Skip to content

Commit

Permalink
Long string optimization for string column parsing in JSON reader (#1…
Browse files Browse the repository at this point in the history
…3803)

closes #13724

In old code, 1 thread per string is allocated for parsing a string column.
For longer strings (>1024), the runtime of 1-thread-per-string to decode is taking too long even for few strings.

In this change, 1 warp per string is used for parsing for strings length <=1024 and 1 block per string for string length >1024. If max string length < 128, 1 thread per string is used as usual.

256 threads_per_block is used for both kernels.
Code for 1-warp-per-string and 1-block-per-string is similar, but only varies with warp-wide and block-wide primitives for reduction and scan operations. shared memory usage will differ slightly too.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13803
  • Loading branch information
karthikeyann authored Sep 20, 2023
1 parent c016b58 commit 97501d8
Show file tree
Hide file tree
Showing 12 changed files with 1,395 additions and 587 deletions.
2 changes: 2 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -413,11 +413,13 @@ add_library(
src/io/utilities/arrow_io_source.cpp
src/io/utilities/column_buffer.cpp
src/io/utilities/config_utils.cpp
src/io/utilities/data_casting.cu
src/io/utilities/data_sink.cpp
src/io/utilities/datasource.cpp
src/io/utilities/file_io_utilities.cpp
src/io/utilities/parsing_utils.cu
src/io/utilities/row_selection.cpp
src/io/utilities/type_inference.cu
src/io/utilities/trie.cu
src/jit/cache.cpp
src/jit/parser.cpp
Expand Down
Loading

0 comments on commit 97501d8

Please sign in to comment.