Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot read non-annotated fixed_len_byte_array data from Parquet as binary (string) #13304

Closed
NVnavkumar opened this issue May 5, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@NVnavkumar
Copy link
Contributor

Describe the bug
Parquet supports the FIXED_LEN_BYTE_ARRAY physical type in addition to the BINARY physical type (which is basically a variable length byte array). cuDF can currently read Decimals and other types which are stored as FIXED_LEN_BYTE_ARRAY, but can't read the FIXED_LEN_BYTE_ARRAY data as binary data (or string). It should behave the same as BINARY.

Basically, this schema:

message table {
| required FIXED_LEN_BYTE_ARRAY(10) bin_test;
|}

should be handled the same as:

message table {
|    required BINARY bin_test;
|}

This is required for NVIDIA/spark-rapids#7449

Steps/Code to reproduce bug
flba_binary_parquet.zip

The attached parquet file has one row with the following schema:

message spark {
            |  required fixed_len_byte_array(5) a;
            |  required fixed_len_byte_array(5) b;
            |}

In cuDF Python:

>>> df = cudf.read_parquet("/tmp/flba_binary.parquet")
>>> df
  a b
0

Expected behavior

The output should be:

>>> df
       a      b
0  hello  there

Environment overview (please complete the following information)

  • Docker commands:
docker pull rapidsai/rapidsai-core-nightly:23.06-cuda11.8-runtime-ubuntu22.04-py3.10
docker run --gpus all --rm -it \
        --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
        -p 8888:8888 -p 8787:8787 -p 8786:8786 \
        rapidsai/rapidsai-core-nightly:23.06-cuda11.8-runtime-ubuntu22.04-py3.10

Ran ipython inside Docker container

@revans2
Copy link
Contributor

revans2 commented May 8, 2023

I think this is a duplicate of #12590

@NVnavkumar
Copy link
Contributor Author

Closing this as duplicate of #12590

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants