Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] libcudf chunked Parquet reader hangs when loading Parquet with older array encoding #13239

Closed
jlowe opened this issue Apr 27, 2023 · 3 comments · Fixed by #13277
Closed
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Apr 27, 2023

Describe the bug
Potentially relates to #13237. Using the same Parquet test file from #13237, using the chunked Parquet reader to load the file will hang from the caller's perspective, never returning and consuming 100% of a CPU core.

Steps/Code to reproduce bug
Using the same test file from #13237, the following C++ program will reproduce the hang.

#include <iostream>

#include <cudf/io/parquet.hpp>
#include <cudf/lists/lists_column_view.hpp>
#include <cudf/types.hpp>

int main(int argc, char** argv) {
  std::cerr << "Creating reader" << std::endl;
  auto reader = cudf::io::chunked_parquet_reader(1L << 31, 
    cudf::io::parquet_reader_options::builder(cudf::io::source_info("pq891392009.parquet")));
  while (reader.has_next()) {
    std::cerr << "Reading batch" << std::endl;
    auto chunk = reader.read_chunk();
  }
  std::cerr << "Done reading" << std::endl;
  return 0;
}

Expected behavior
libcudf calls should not hang/crash on valid inputs.

@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Apr 27, 2023
@vuule
Copy link
Contributor

vuule commented Apr 28, 2023

I ran this test locally. Surprisingly, it actually completes after a few minutes. Fails with a bad_alloc exception.

@vuule
Copy link
Contributor

vuule commented Apr 29, 2023

Some more info:
The reader get stuck in this loop
https://github.com/rapidsai/cudf/blob/branch-23.06/cpp/src/io/parquet/reader_impl_preprocess.cu#L1080
in find_splits.
cur_row_count stays at zero while the num_rows is one, so it never exist the loop.
The bad_alloc is thrown because the loop keeps adding empty splits to, well, splits.

FWIW, we could detect that we're creating empty splits and throw instead of hanging until the heap is full.

CC @nvdbaranec who might better understand what is going wrong in the reader here.

@nvdbaranec
Copy link
Contributor

This appears to be a misinterpretation of the schema due to an overzealous is_one_level_list() check. @hyperbolic2346 is looking at a fix.

@rapids-bot rapids-bot bot closed this as completed in 0a5065f May 8, 2023
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants