Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] parquet reader::impl::decode_page_data error_code checking slowness #15122

Closed
abellina opened this issue Feb 22, 2024 · 7 comments
Closed
Labels
bug Something isn't working

Comments

@abellina
Copy link
Contributor

abellina commented Feb 22, 2024

While benchmarking some code we found that about 5% worth of time are being lost due to this line of code:
https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L248

This is on our perf cluster (A100) for NDS @3k. It explains some of a dip in perf we have seen since 23.10 but we haven't gotten around to testing.

If I stop obtaining the value from error_code (e.g. I don't perform the pageable memcpy essentially) we gain 20 seconds locally. I am filing this because it may be a good idea to remove this or look into how to improve it (would a pinned copy help)?

@abellina abellina added the bug Something isn't working label Feb 22, 2024
@abellina
Copy link
Contributor Author

@nvdbaranec @vuule fyi

@vuule
Copy link
Contributor

vuule commented Feb 22, 2024

That's wild. Have you maybe looked at a profile to see how we spend the additional time?

@abellina
Copy link
Contributor Author

it's over many queries, I think it is due to the pageable copy (synchronization due to staging copies). I wonder if a quick prototype would be to swap the pageable copy by a pinned one and see if we recover the amount of time lost.

@vuule
Copy link
Contributor

vuule commented Feb 23, 2024

I can prototype something tomorrow. My concern is that we'll break even without the pinned pool due to the extra pinned allocation.

@abellina
Copy link
Contributor Author

this would imply having @nvdbaranec's pinned pool #15079, so the allocation should be virtually free :)

@vuule
Copy link
Contributor

vuule commented Feb 23, 2024

yeah but only spark will have the pool for now ;)

@GregoryKimball
Copy link
Contributor

Thank you @abellina for posting this. Please also consult #14167 and #14415 for a similar error code check and performance fix. Would you please investigate a single-GPU single-node reproducer for this performance issue? Do you think a single-GPU reproducer is possible?

rapids-bot bot pushed a commit that referenced this issue Mar 5, 2024
…15140)

Issue #15122

The addition of kernel error checking introduced a 5% performance regression in Spark-RAPIDS. It was determined that the pageable copy of the error back to host caused this overhead, presumably because of the CUDA's bounce buffer bottleneck.

This PR aims to eliminate most of the error checking overhead by using `hostdevice_vector` in the `kernel_error` class. The `hostdevice_vector` uses pinned memory so the copy is no longer pageable. The PR also removes the redundant sync after we read the error.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Paul Mattione (https://github.com/pmattione-nvidia)

URL: #15140
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants