Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve JSON reader performance with "document" strings columns #13724

Closed
GregoryKimball opened this issue Jul 19, 2023 · 0 comments · Fixed by #13803
Closed

[FEA] Improve JSON reader performance with "document" strings columns #13724

GregoryKimball opened this issue Jul 19, 2023 · 0 comments · Fixed by #13803
Assignees
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jul 19, 2023

In the "Common Crawl" document dataset, the size of "documents" (i.e. large strings) varies from 200 characters to 1M characters and typically each file contains a few thousand "documents". In libcudf JSON reading, the get_token_stream step is data-parallel and efficiently processes the strings data of these documents. However in the device_json_column_to_cudf_column and parse_data steps, we observe warp divergence and longer runtimes for variable document sizes. Depending on the row count, this divergence results in 20% to 6x slower end-to-end read time for variable length strings compared to constant length strings.

Here is a code snippet to demonstrate the impact of variable document sizes.

import time
from io import BytesIO

import cudf
import cupy
import nvtx

cupy.random.seed(0)

num_rows = 6000
df = cudf.DataFrame({'a': ['a']*num_rows})
df['b'] = df['a'].str.repeat(cupy.random.lognormal(7.6, 1.2, num_rows).astype(int))
average_length = df['b'].str.len().sum() // num_rows
variable_lengths_buffer = BytesIO()
df.to_json(variable_lengths_buffer, engine='cudf', lines=True)

df = cudf.DataFrame({'a': ['a']*num_rows})
df['b'] = df['a'].str.repeat(average_length)
constant_lengths_buffer = BytesIO()
df.to_json(constant_lengths_buffer, engine='cudf', lines=True)

with nvtx.annotate('variable lengths', color="yellow"):
    t0 = time.time()
    df = cudf.read_json(variable_lengths_buffer, lines=True)
    t1 = time.time()
print('variable lengths', t1-t0)

with nvtx.annotate('constant lengths', color="yellow"):
    t0 = time.time()
    df = cudf.read_json(constant_lengths_buffer, lines=True)
    t1 = time.time()
print('constant lengths', t1-t0)
variable lengths 0.12197709083557129
constant lengths 0.02486705780029297

The end-to-end impact of the warp divergence seems to be reduced when reading large numbers of documents (100K+), which explains why multi-source reads of document data have shown better performance.

Here's a profile of 100K num_rows, which shows the difference in runtime and warp occupancy in parse_data. Also this higher number of rows shows less difference in end-to-end runtime.
image

Here's a profile of 6K num_rows, which shows a 6x increase in end-to-end reading time for variable length strings versus constant length strings.
image

Also, here is a snapshot of "Common Crawl" document lengths and the lognormal(7.6, 1.2) distribution used to model it.
image

@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue labels Jul 19, 2023
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Jul 19, 2023
@karthikeyann karthikeyann self-assigned this Aug 9, 2023
rapids-bot bot pushed a commit that referenced this issue Sep 20, 2023
…3803)

closes #13724

In old code, 1 thread per string is allocated for parsing a string column.
For longer strings (>1024), the runtime of 1-thread-per-string to decode is taking too long even for few strings.

In this change, 1 warp per string is used for parsing for strings length <=1024 and 1 block per string for string length >1024. If max string length < 128, 1 thread per string is used as usual.

256 threads_per_block is used for both kernels.
Code for 1-warp-per-string and 1-block-per-string is similar, but only varies with warp-wide and block-wide primitives for reduction and scan operations. shared memory usage will differ slightly too.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13803
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants