Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading BYTE_STREAM_SPLIT encoding in parquet #12809

Merged

Conversation

manupatteri
Copy link
Member

@manupatteri manupatteri commented Jun 12, 2022

Description

Support reading BYTE_STREAM_SPLIT encoding in parquet

Related issues, pull requests, and links

Fixes #8357

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Hive, Hudi, Delta Lake, Iceberg
* Support reading BYTE_STREAM_SPLIT encoding in parquet. ({issue}`8357`)

@manupatteri manupatteri force-pushed the enable_byte_stream_split_parquet branch 3 times, most recently from 75646e6 to 9c4b277 Compare November 27, 2022 02:37
@manupatteri manupatteri force-pushed the enable_byte_stream_split_parquet branch 4 times, most recently from 1af0a48 to bea52a6 Compare November 28, 2022 10:24
@manupatteri manupatteri force-pushed the enable_byte_stream_split_parquet branch 2 times, most recently from d612d9e to cc9e664 Compare January 12, 2023 17:42
@github-actions github-actions bot added the hive Hive connector label May 29, 2023
@manupatteri
Copy link
Member Author

@raunaqmorarka could you please review it again? I thought I can resolve conflicts after basic review is done. Please let me know if I should fix it first.

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve the merge conflicts and squash commits.

@manupatteri manupatteri force-pushed the enable_byte_stream_split_parquet branch from 499cf9e to 44bd0b0 Compare October 30, 2023 10:55
@manupatteri manupatteri force-pushed the enable_byte_stream_split_parquet branch from 44bd0b0 to 2bce6f1 Compare December 10, 2023 08:00
@manupatteri manupatteri force-pushed the enable_byte_stream_split_parquet branch 2 times, most recently from 9ddbcc8 to 3f5fd2f Compare January 7, 2024 08:07
@manupatteri
Copy link
Member Author

@raunaqmorarka I am working on a testcase for float. Appreciate if you can review on rest of changes.

@mosabua
Copy link
Member

mosabua commented Jan 15, 2024

@raunaqmorarka can you please follow up for @manupatteri

@manupatteri manupatteri force-pushed the enable_byte_stream_split_parquet branch 5 times, most recently from 53344fa to e43da5b Compare February 29, 2024 18:29
@manupatteri
Copy link
Member Author

@raunaqmorarka @mosabua I have added tests for float as well. Kindly review it again and let me know your comments.

@mosabua mosabua requested a review from martint March 19, 2024 15:30
@raunaqmorarka raunaqmorarka force-pushed the enable_byte_stream_split_parquet branch 3 times, most recently from a4eb241 to 41bb4d5 Compare March 19, 2024 19:19
@manupatteri manupatteri force-pushed the enable_byte_stream_split_parquet branch 2 times, most recently from 59cb9bf to 824e8b7 Compare March 24, 2024 09:12
@manupatteri
Copy link
Member Author

manupatteri commented Mar 24, 2024

@raunaqmorarka
When I used the common static method, "initialize" in init method in DoubleApacheParquetValueDecoder and FloatApacheParquetValueDecoder, I was getting below exception with byte_stream_split testcase.

org.apache.parquet.io.ParquetDecodingException: Invalid ByteStreamSplit stream, num values upper bound (w/ nulls): 0, num encoded values: 100

at org.apache.parquet.column.values.bytestreamsplit.ByteStreamSplitValuesReader.initFromPage(ByteStreamSplitValuesReader.java:77)
at io.trino.parquet.reader.decoders.ApacheParquetValueDecoders.initialize(ApacheParquetValueDecoders.java:142)
at io.trino.parquet.reader.decoders.ApacheParquetValueDecoders$FloatApacheParquetValueDecoder.init(ApacheParquetValueDecoders.java:120)
at io.trino.parquet.reader.AbstractColumnReader.createValueDecoder(AbstractColumnReader.java:109)
at io.trino.parquet.reader.flat.FlatColumnReader.readFlatPageV1(FlatColumnReader.java:298)
at io.trino.parquet.reader.flat.FlatColumnReader.readPage(FlatColumnReader.java:261)
at io.trino.parquet.reader.flat.FlatColumnReader.seekToNextPage(FlatColumnReader.java:245)
at io.trino.parquet.reader.flat.FlatColumnReader.seek(FlatColumnReader.java:116)
at io.trino.parquet.reader.flat.FlatColumnReader.readPrimitive(FlatColumnReader.java:87)
at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:463)
at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:553)
at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:536)
at io.trino.parquet.reader.ParquetReader.lambda$nextPage$3(ParquetReader.java:251)
at io.trino.parquet.reader.ParquetBlockFactory$ParquetBlockLoader.load(ParquetBlockFactory.java:72)
at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:312)
at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:291)
at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:186)
at io.trino.parquet.reader.TestByteStreamSplitEncoding.readAndCompare(TestByteStreamSplitEncoding.java:88)
at io.trino.parquet.reader.TestByteStreamSplitEncoding.testReadFloatDouble(TestByteStreamSplitEncoding.java:57)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
at org.testng.TestRunner.privateRun(TestRunner.java:756)
at org.testng.TestRunner.run(TestRunner.java:610)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:387)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:382)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:340)
at org.testng.SuiteRunner.run(SuiteRunner.java:289)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1293)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1218)
at org.testng.TestNG.runSuites(TestNG.java:1133)
at org.testng.TestNG.run(TestNG.java:1104)
at com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:65)
at com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:105)

To fix that I had to introduce changes in "initialize" method(to use correct valueCount in ValuesReader::initFromPage call)

@raunaqmorarka raunaqmorarka force-pushed the enable_byte_stream_split_parquet branch from e8f5639 to db166a5 Compare March 25, 2024 04:52
Introduced DoubleApacheParquetValueDecoder and FloatApacheParquetValueDecoder
to handle double and float types respectively
@raunaqmorarka raunaqmorarka force-pushed the enable_byte_stream_split_parquet branch from db166a5 to 1681912 Compare March 25, 2024 04:53
@raunaqmorarka raunaqmorarka changed the title Enable byte stream split parquet Support reading BYTE_STREAM_SPLIT encoding in parquet Mar 25, 2024
@raunaqmorarka raunaqmorarka merged commit 5a5a096 into trinodb:master Mar 25, 2024
57 checks passed
@github-actions github-actions bot added this to the 444 milestone Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

Implement BYTE_STREAM_SPLIT encoding for Parquet
3 participants