Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf should throw exception when reading parquet file with unsupported codec #10602

Closed
wbo4958 opened this issue Apr 6, 2022 · 0 comments · Fixed by #10610
Closed

[BUG] cudf should throw exception when reading parquet file with unsupported codec #10602

wbo4958 opened this issue Apr 6, 2022 · 0 comments · Fixed by #10610
Labels
bug Something isn't working cuIO cuIO issue Spark Functionality that helps Spark RAPIDS

Comments

@wbo4958
Copy link
Contributor

wbo4958 commented Apr 6, 2022

Describe the bug
Cudf returns wrong values when reading parquet file with zstd codec in the latested 22.04 nightly build. The file is test.zstd.txt.

Or Cudf throws exception rmm-src/include/rmm/cuda_stream_view.hpp:81: cudaErrorIllegalAddress an illegal memory access was encountered, which I did not reproduce.

Meta info of test.zstd.txt

creator:     parquet-mr version 1.12.1 (build 2a5c06c58fa987f85aa22170be14d927d5ff6e7d) 
extra:       org.apache.spark.version = 3.2.0 
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"a","type":"integer","nullable":false,"metadata":{}}]} 

file schema: spark_schema 
--------------------------------------------------------------------------------
a:           REQUIRED INT32 R:0 D:0

row group 1: RC:3 TS:35 OFFSET:4 
--------------------------------------------------------------------------------
a:            INT32 ZSTD DO:0 FPO:4 SZ:44/35/0.80 VC:3 ENC:BIT_PACKED,PLAIN ST:[min: 1, max: 3, num_nulls: 0]

cudf read

In [1]: import cudf

In [2]: df = cudf.read_parquet("parquet-zstd")

In [3]: df.a
Out[3]: 
0    768
1    768
2      0
Name: a, dtype: int32

pandas read

In [14]: import pandas as pd

In [15]: df = pd.read_parquet("parquet-zstd")

In [16]: df.a
Out[16]: 
0    1
1    2
2    3
Name: a, dtype: int32

Expected behavior

Although Cudf does not support zstd compression, it should throw an exception with "unsupported zstd codec" instead of returning the wrong value or cudaErrorIllegalAddress ...

@wbo4958 wbo4958 added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Apr 6, 2022
@jlowe jlowe added the Spark Functionality that helps Spark RAPIDS label Apr 6, 2022
rapids-bot bot pushed a commit that referenced this issue Apr 7, 2022
Closes #10602

This PR adds a compression type check for each chunk in the input file. 
Reader throws in an unsupported compression is used.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - https://github.com/brandon-b-miller
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: #10610
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants