Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Adjust libcudf to use kvikIO for small host reads #17259

Closed
GregoryKimball opened this issue Nov 6, 2024 · 4 comments
Closed

[FEA] Adjust libcudf to use kvikIO for small host reads #17259

GregoryKimball opened this issue Nov 6, 2024 · 4 comments
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Nov 6, 2024

Is your feature request related to a problem? Please describe.

For parquet columns that are highly compressible, the page size can be very small. For example, we observe 50-100 KB pages for the NDS-H column l_shipinstruct in lineitems with ZSTD compression. This column has a cardinality of 4, encodes well with run_length, and (may) show other compression opportunities as well.

>>> df['l_shipinstruct'].unique()
0    DELIVER IN PERSON
1     TAKE BACK RETURN
2                 NONE
3          COLLECT COD

When performing IO on these pages, libcudf falls back to a host read from pageable instead of a pinned read via kvikIO.

For an async memory resource, using host read from pageable is not that different than the kvikIO option. However, when using a managed memory pool resource, host read from pageable appears to be ~2x slower in the IO step.

Image

Describe the solution you'd like
We should consider adjusting libcudf to always use kvikIO instead.

I believe we should consider refactoring datasource to not use a read threshold, and also decouple the gds labels from dispatch to kvikIO for host reads.

static constexpr size_t _gds_read_preferred_threshold = 128 << 10;  // 128KB

Additional context

We've collected evidence that with a managed memory pool, the HtoD host copy from pageable to managed is slower than the copy from pageable to device, at least for these small pages.

@GregoryKimball GregoryKimball added cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Nov 6, 2024
@GregoryKimball
Copy link
Contributor Author

@brandon-b-miller @vuule thank you for the discussion today about the pageable copy difference with a managed pool MR.

@vuule
Copy link
Contributor

vuule commented Nov 6, 2024

Related issue #17228: use kvikIO for host reads.
Most likely not required to avoid the pageable copies in this case.

@vuule
Copy link
Contributor

vuule commented Nov 6, 2024

Opened #17260 with a potential fix.
Haven't evaluated the performance impact.

rapids-bot bot pushed a commit that referenced this issue Nov 12, 2024
…ed (#17260)

Issue #17259

Avoid checking `_gds_read_preferred_threshold` threshold when deciding whether `device_read`/device_write` is preferred to host IO + copy. The reasons are twofold:
1. KvikIO already has an internal threshold for GDS use so we don't need to check on our end as well.
2. Without actual GDS use, kvikIO uses a pinned bounce buffer to efficiently copy to/from the device.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Basit Ayantunde (https://github.com/lamarrr)

URL: #17260
@vyasr
Copy link
Contributor

vyasr commented Nov 13, 2024

Resolved by #17260 (@vuule @GregoryKimball please reopen if needed).

@vyasr vyasr closed this as completed Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: No status
Development

No branches or pull requests

4 participants