-
Notifications
You must be signed in to change notification settings - Fork 924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Use bloom filters in Parquet reader to filter row groups with equality predicates #17164
Labels
cuco
cuCollections related issue
cuIO
cuIO issue
feature request
New feature or request
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
Milestone
Comments
mhaseeb123
added
cuco
cuCollections related issue
cuIO
cuIO issue
feature request
New feature or request
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
labels
Oct 24, 2024
sleeepyjack
pushed a commit
to NVIDIA/cuCollections
that referenced
this issue
Oct 30, 2024
This PR adds a new Bloom Filter policy implementing the Arrow BF algorithm. This PR is a part of rapidsai/cudf#17164. A follow-up PR will add tests for bitwise validation of bloom filter using arrow policy. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Yunsong Wang <[email protected]>
sleeepyjack
pushed a commit
to NVIDIA/cuCollections
that referenced
this issue
Nov 1, 2024
This PR adds a tests to validate the bitset from inserting specific keys to a `cuco::bloom_filter` with `cuco::arrow_filter_policy` against the one generated by inserting the same keys to the implementation in Arrow. Related to #625. Part of rapidsai/cudf#17164. Reference bitset gen with arrow here: https://godbolt.org/z/ebdddezbP --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
3 tasks
This was referenced Dec 13, 2024
rapids-bot bot
pushed a commit
that referenced
this issue
Dec 20, 2024
…st::tree` (#17587) This PR simplifies the StatsAST expression transformer in Parquet reader's predicate pushdown using `ast::tree` from (#17156). This PR is a follow up to @bdice's comment at #17289 (comment). Similar changes for the `BloomfilterAST` expression converter have been incorporated in the PR #17289. Related to #17164 Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #17587
rapids-bot bot
pushed a commit
that referenced
this issue
Jan 14, 2025
…s using them (#17289) This PR adds support to read bloom filters from Parquet files and use them to filter row groups based on `col == literal` like predicate(s), if provided. Related to #17164 Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) - Bradley Dice (https://github.com/bdice) URL: #17289
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Jan 22, 2025
…ffers (#17758) Related to #17164 Related to NVIDIA/cuCollections#660 This PR creates and uses a `rmm::mr::aligned_resource_adapter` to allocate device buffers for bloom filter data in accordance with bloom filter [alignment requirements](https://github.com/NVIDIA/cuCollections/blob/e79787be2cb3de1b12e90d56355612e47395cce5/include/cuco/detail/bloom_filter/bloom_filter_impl.cuh#L359-L362). This PR also updates the `query_bloom_filter` function to use the new `bloom_filter_ref` constructors introduced in NVIDIA/cuCollections#660. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #17758
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
cuco
cuCollections related issue
cuIO
cuIO issue
feature request
New feature or request
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
Is your feature request related to a problem? Please describe.
In Parquet reader, we can use the
cuco::bloom_filter_ref
with a customcuco::bloom_filter_policy
to filter row groups when we have an equality predicate. This would allow us to potentially reduce I/O.The custom
cuco::bloom_filter_policy
would need to implement Arrow's logic for generating the bit pattern, selecting bloom filter blocks and selecting a filter block for a given key and would also be used to write our own bloom filters to Parquet (in the writer's side) in the future.Describe the solution you'd like
Use
cuco::bloom_filter
with a customcuco::bloom_filter_policy
to implement Arrow's BF logic in Parquet reader to filter row gorups.Additional context
The 1:1 Arrow BF policy may be implemented directly in cuco or upstreamed later on from cudf for exposure to broader RAPIDS.
Associated Subtasks
cuco::bloom_filter_policy
to mimic Arrow BF policycuco::bloom_filter
with the read BF bitset and policy in Parquet reader* check min/max stats and bloom filter simultaneously to prune column chunks
* identify which columns have equality conditions
* read the bloom filters only for the relevant column chunks
✅ NVIDIA/cuCollections#654 updates
arrow_filter_policy
to not rely on xxhash64's member types to be consistent with STLtable_with_metadata
ast::tree
✅ rapidsai/rapids-cmake#735 bumps cuco to include changes from NVIDIA/cuCollections#654
aligned_resource_adaptor
to allocate bloom filter buffers and use newbloom_filter_ref
ctors✅ NVIDIA/cuCollections#660 New cuco constructors that avoid
__trap
leading to seg fault (thanks @sleeepyjack and @PointKernel)The text was updated successfully, but these errors were encountered: