[FEA] Improve JSON reader performance with "document" strings columns #13724

GregoryKimball · 2023-07-19T16:45:58Z

In the "Common Crawl" document dataset, the size of "documents" (i.e. large strings) varies from 200 characters to 1M characters and typically each file contains a few thousand "documents". In libcudf JSON reading, the get_token_stream step is data-parallel and efficiently processes the strings data of these documents. However in the device_json_column_to_cudf_column and parse_data steps, we observe warp divergence and longer runtimes for variable document sizes. Depending on the row count, this divergence results in 20% to 6x slower end-to-end read time for variable length strings compared to constant length strings.

Here is a code snippet to demonstrate the impact of variable document sizes.

import time
from io import BytesIO

import cudf
import cupy
import nvtx

cupy.random.seed(0)

num_rows = 6000
df = cudf.DataFrame({'a': ['a']*num_rows})
df['b'] = df['a'].str.repeat(cupy.random.lognormal(7.6, 1.2, num_rows).astype(int))
average_length = df['b'].str.len().sum() // num_rows
variable_lengths_buffer = BytesIO()
df.to_json(variable_lengths_buffer, engine='cudf', lines=True)

df = cudf.DataFrame({'a': ['a']*num_rows})
df['b'] = df['a'].str.repeat(average_length)
constant_lengths_buffer = BytesIO()
df.to_json(constant_lengths_buffer, engine='cudf', lines=True)

with nvtx.annotate('variable lengths', color="yellow"):
    t0 = time.time()
    df = cudf.read_json(variable_lengths_buffer, lines=True)
    t1 = time.time()
print('variable lengths', t1-t0)

with nvtx.annotate('constant lengths', color="yellow"):
    t0 = time.time()
    df = cudf.read_json(constant_lengths_buffer, lines=True)
    t1 = time.time()
print('constant lengths', t1-t0)

variable lengths 0.12197709083557129
constant lengths 0.02486705780029297

The end-to-end impact of the warp divergence seems to be reduced when reading large numbers of documents (100K+), which explains why multi-source reads of document data have shown better performance.

Here's a profile of 100K num_rows, which shows the difference in runtime and warp occupancy in parse_data. Also this higher number of rows shows less difference in end-to-end runtime.

Here's a profile of 6K num_rows, which shows a 6x increase in end-to-end reading time for variable length strings versus constant length strings.

Also, here is a snapshot of "Common Crawl" document lengths and the lognormal(7.6, 1.2) distribution used to model it.

The text was updated successfully, but these errors were encountered:

…3803) closes #13724 In old code, 1 thread per string is allocated for parsing a string column. For longer strings (>1024), the runtime of 1-thread-per-string to decode is taking too long even for few strings. In this change, 1 warp per string is used for parsing for strings length <=1024 and 1 block per string for string length >1024. If max string length < 128, 1 thread per string is used as usual. 256 threads_per_block is used for both kernels. Code for 1-warp-per-string and 1-block-per-string is similar, but only varies with warp-wide and block-wide primitives for reduction and scan operations. shared memory usage will differ slightly too. Authors: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) - Elias Stehle (https://github.com/elstehle) - Lawrence Mitchell (https://github.com/wence-) URL: #13803

GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue labels Jul 19, 2023

GregoryKimball added this to the Nested JSON reader milestone Jul 19, 2023

GregoryKimball added this to libcudf Jul 19, 2023

karthikeyann mentioned this issue Aug 2, 2023

Long string optimization for string column parsing in JSON reader #13803

Merged

3 tasks

karthikeyann self-assigned this Aug 9, 2023

rapids-bot bot closed this as completed in #13803 Sep 20, 2023

GregoryKimball removed this from libcudf Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve JSON reader performance with "document" strings columns #13724

[FEA] Improve JSON reader performance with "document" strings columns #13724

GregoryKimball commented Jul 19, 2023 •

edited

Loading

[FEA] Improve JSON reader performance with "document" strings columns #13724

[FEA] Improve JSON reader performance with "document" strings columns #13724

Comments

GregoryKimball commented Jul 19, 2023 • edited Loading

GregoryKimball commented Jul 19, 2023 •

edited

Loading