[FEA] Improve JSON reader performance with "document" strings columns #13724
Labels
0 - Backlog
In queue waiting for assignment
cuIO
cuIO issue
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Milestone
In the "Common Crawl" document dataset, the size of "documents" (i.e. large strings) varies from 200 characters to 1M characters and typically each file contains a few thousand "documents". In libcudf JSON reading, the
get_token_stream
step is data-parallel and efficiently processes the strings data of these documents. However in thedevice_json_column_to_cudf_column
andparse_data
steps, we observe warp divergence and longer runtimes for variable document sizes. Depending on the row count, this divergence results in 20% to 6x slower end-to-end read time for variable length strings compared to constant length strings.Here is a code snippet to demonstrate the impact of variable document sizes.
The end-to-end impact of the warp divergence seems to be reduced when reading large numbers of documents (100K+), which explains why multi-source reads of document data have shown better performance.
Here's a profile of 100K

num_rows
, which shows the difference in runtime and warp occupancy inparse_data
. Also this higher number of rows shows less difference in end-to-end runtime.Here's a profile of 6K

num_rows
, which shows a 6x increase in end-to-end reading time for variable length strings versus constant length strings.Also, here is a snapshot of "Common Crawl" document lengths and the

lognormal(7.6, 1.2)
distribution used to model it.The text was updated successfully, but these errors were encountered: