-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summary of Parquet reader Issues #9560
Comments
@8dukongjian Very informative, thanks! I'd suggest submit issues for each problem you found with detailed description and reproducible case, and use this issue as an umbrella issue to track all of them. |
Thanks, good suggestions,I will work it later. |
@8dukongjian Thank you for the summary. Do you plan to work on fixing these? |
Thank you! It's what we need. @yma11 can add more tests result from parquet-mr later. |
Thanks! looks there are more issues related to encoding support which are not covered before. |
@8dukongjian Thanks Sitao. I'll triage them tomorrow myself, then we will go over the list in this week's Parquet sync meeting on Friday 1pm. Please feel free to join. |
Hi @qqibrow are you talking about the V2 encodings or existing ones? |
@yingsu00 Are there any plans to support Parquet v2 encoding? |
Yes I do have the plan to support v2 encodings, but it will be second half. Do you need it urgently? |
@yingsu00 So I have a silly question: Reetika tried to write hive tables in Prestissimo and the parquet file has format version 2.6 when I inspected it with parquet-tools. So that’s not Parquet v2? Where should I look at if I want to tell the version of a Parquet file? |
Or maybe that file didn’t use any Parquet v2 specific features so Prestissimo can still read it… |
@yzhang1991 I think the 2.x version refers to the DataPage version, and does not necessarily mean the encodings of the data are all V2. The writer could just be encoding the data in V1 encodings. We usually say the following encodings are V2 encodings, see https://parquet.apache.org/docs/file-format/data-pages/encodings/
Currently the Velox Parquet reader can read both V1 and V2 DataPage headers, but only support 9 in from this PR. Support BYTE_STREAM_SPLIT encoding in native Parquet reader 5,6,7 need to be added. |
@yingsu00 Yes, we need to use DELTA_BYTE_ARRAY encoding. |
@8dukongjian I took the liberty to edit the table and added a "seq" column to help us quickly identify the issues Also created Parquet reading failed to decompress LZO files and @nmahadevuni will take a look. |
Ok, we'll prioritize this encoding then. |
Thanks, now the table is clearer. |
@8dukongjian does your team have bandwidth to take No.9 in bugs? currently no one is working on that. |
@qqibrow @8dukongjian I will take a look at bug no. 9 |
@yingsu00 thanks. I haven't got the time for a detail check. I am wondering whether
|
@qqibrow We need a BooleanRleBpDataDecoder. There are a number of other types/decoders need to be added. I'll create a separate issue for it. |
Took a quick looks at bug 8 mentioned above, named Null pointer , It seems to be due to unsupported |
@yma11 Thanks! could you create a issue and share the stacktrace and file to reproduce there? also, are you going to work on that? |
9757 is created for track. So we need create issue for each failure? I thought we would like to keep in this consolidated place. I don't have bandwidth for fix for now but firstly focus on finding more issues leverage parquet fuzzer test. Will you can help on it? |
Let's create one issue for each failure reason, and document clearly the failure message. It's at least easy to know the issue we meet during Gluten run is already known in the list or not. |
Thanks, I updated #9757 in the table |
@8dukongjian 10752 and fix. |
Bug description
Using the test method provided by @qqibrow #7478, four compression formats(GZIP, SNAPPY, LZO and UNCOMPRESSED) and two parquet versions(V1 and V2) were tested, with a total of eight test scenarios. Summarize the problems discovered by the test into the table below.
Bugs
1Empty collection in array or map lead to incorrect results in parquet reader2Parsing complex type errors, eg,ARRAY<STRUCT<test:string>>
is parsed intoARRAY<string>
3children size should not be larger than 25ColumnMetaData does not exist for schema Iddecompression failed, decompressedSize is not equal to remainingOutputSizeFor raw decompression, compressedLength should be greater than zerocore dump in StringColumnReader::processFilter (missing decoder for VARBINARY on FLBA)Parquet PageReader incorrectly skips rep/def levels when the max values are 0Velox parquet scan fail when select row index column before data columnFile to Reproduce:
int96.zip
timestamp_mills.zip
timestamp_micros.zip
array_struct.zip
children.zip
presetNulls.zip
ColumnMetaData.zip
lzo.zip
delta_byte_array.zip
rle.zip
array.zip
snappy.zip
System information
None
Relevant logs
No response
The more complete feature request is in #9767
Feature requests (back up):
plain encoding: oap-project#456
The text was updated successfully, but these errors were encountered: