Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClassCastException possible in DeltaByteArrayReader after PARQUET-2431 #3013

Closed
bwjoh opened this issue Sep 13, 2024 · 2 comments · Fixed by #3019
Closed

ClassCastException possible in DeltaByteArrayReader after PARQUET-2431 #3013

bwjoh opened this issue Sep 13, 2024 · 2 comments · Fixed by #3019

Comments

@bwjoh
Copy link

bwjoh commented Sep 13, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Noticed when upgrading from 1.13.1 to 1.14.1

java.lang.ClassCastException: class org.apache.parquet.column.values.dictionary.DictionaryValuesReader cannot be cast to class org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader (org.apache.parquet.column.values.dictionary.DictionaryValuesReader and org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader are in unnamed module of loader 'app')
	at org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader.setPreviousReader(DeltaByteArrayReader.java:92)
	at org.apache.parquet.column.impl.ColumnReaderBase.initDataReader(ColumnReaderBase.java:734)
	at org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:766)
	at org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:56)
	at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:695)
	at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:686)
	at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:232)
	at org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:686)
	at org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:660)
	at org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:802)
	at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
	at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:427)

This appears to be due to PARQUET-2431 - https://github.com/apache/parquet-java/pull/1274/files#diff-362b7d44b24283c1bb1f6ca3e124cb72706a33ed96d86b58bf3339f20aafb4e9R732

Looking into how my code hit this and it seems to be that CorruptDeltaByteArrays.requiresSequentialReads was essentially doing the dataColumn instanceof RequiresPreviousReader check previously (CorruptDeltaByteArrays.requiresSequentialReads can only return true when encoding == Encoding.DELTA_BYTE_ARRAY, and org.apache.parquet.column.values.RequiresPreviousReader is only implemented by *DeltaByteArrayReader classes).

With no check on previousReader instanceof RequiresPreviousReader the ClassCastException is possible above.

This is more likely to happen when using org.apache.parquet.io.ColumnIOFactory#ColumnIOFactory() to read files without createdBy. In my case I was able to fix this by adding createdBy, knowing that all Parquet files I have were written after PARQUET-246, which prevents CorruptDeltaByteArrays.requiresSequentialReads from returning true

val reader: ParquetFileReader = ...
val fileMetadata = reader.getFooter.getFileMetaData
val createdBy = fileMetadata.getCreatedBy
val columnIO: MessageColumnIO = new ColumnIOFactory(createdBy)...

Component(s)

No response

@wgtmac
Copy link
Member

wgtmac commented Sep 19, 2024

Thanks for reporting the bug! Is it possible to provide a file that can reproduce this issue?

cc @gszadovszky this issue seems to be caused by a recent refactoring commit.

@gszadovszky
Copy link
Contributor

Thanks, @bwjoh. It seems I've overlooked how this part worked. The code is not super clear, unfortunately. Also, seems we are lacking a unit test for this scenario.
Would you like to contribute a fix for this one?

gszadovszky added a commit to gszadovszky/parquet-mr that referenced this issue Sep 27, 2024
@wgtmac wgtmac modified the milestone: 1.14.3 Sep 30, 2024
dongjoon-hyun pushed a commit to apache/spark that referenced this issue Oct 8, 2024
### What changes were proposed in this pull request?
The pr aims to upgrade `Parquet` from `1.14.2` to `1.14.3`.

### Why are the changes needed?
The full release notes: https://github.com/apache/parquet-java/releases/tag/apache-parquet-1.14.3
apache/parquet-java#3007: Ensure version specific Jackson classes are shaded
apache/parquet-java#3013: Fix potential ClassCastException at reading DELTA_BYTE_ARRAY encoding

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48378 from panbingkun/SPARK-49903.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
himadripal pushed a commit to himadripal/spark that referenced this issue Oct 19, 2024
### What changes were proposed in this pull request?
The pr aims to upgrade `Parquet` from `1.14.2` to `1.14.3`.

### Why are the changes needed?
The full release notes: https://github.com/apache/parquet-java/releases/tag/apache-parquet-1.14.3
apache/parquet-java#3007: Ensure version specific Jackson classes are shaded
apache/parquet-java#3013: Fix potential ClassCastException at reading DELTA_BYTE_ARRAY encoding

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48378 from panbingkun/SPARK-49903.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants