-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] GPU writing ORC columns statistics #4860
Comments
so I think the problem here is that Spark always writes the statistics but at any point later someone might use the feature "spark.sql.orc.aggregatePushdown" during read. This could be in a totally separate job where they don't have the flag enabled during the write. Also I assume the same problem exists with Parquet. For this particular test we could see the flag is set and then fallback to the cpu for the write but it wont help the case where the write is separate job where the flag isn't set. We probably need to make sure its documented. |
Thanks @tgravescs! To reproduce the exception: Configuration: conf={'spark.rapids.sql.format.orc.write.enabled": 'true',
'spark.sql.orc.aggregatePushdown': 'true',
"spark.sql.sources.useV1SourceList": "",
"spark.sql.orc.impl": "native"} Note that The exception is raised when the file is written by GPU and the query on read will be pushed down to the ORC file. @ignore_order
@pytest.mark.parametrize('orc_impl', ["native"])
def test_orc_write_with_aggregate_pushdown_get_file(spark_tmp_path, orc_impl):
data_path = spark_tmp_path + '/ORC_DATA/pushdown_05.orc'
assert_gpu_and_cpu_writes_are_equal_collect(
lambda spark, path: spark.range(10).selectExpr("id", "id % 3 as p").write.partitionBy("p").mode('overwrite').orc(path),
lambda spark, path: spark.read.orc(path).selectExpr('count(p)'),
data_path,
conf={'spark.rapids.sql.format.orc.write.enabled': 'true',
'spark.sql.orc.aggregatePushdown': 'true',
"spark.sql.sources.useV1SourceList": "",
"spark.sql.orc.impl": orc_impl}) Is the file generated by the GPU incorrect? Yes. Having a separate read/write jobs I did a test in two steps: 1- set write job with aggregatePushdown For Spark, this does not raise any exception. Current state GPU does not generate the correct ORC file schema expected for the pushedDownaggregates. The problem here that all files created by GPU will throw exception when the file Falling back To CPU on ORC write? There is a probability that the read job has I will continue digesting the changes introduced to Spark and follow the ORC writing part on the GPU. |
Cudf-FEA-10075: Add File Statistic when writing the ORC file is still not implemented. |
Metadata of the GPU generated file. File statistics is empty ahussein@c240m5-01:~/workspace/repos/arapids-4860/debug_orc$ java -jar orc-tools-1.7.3-uber.jar meta gpu_pushdown_05.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00057-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00057-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 7 max: 7 sum: 7
File Statistics:
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 22
Stream: column 1 section DATA start: 32 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00035-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00035-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 4 max: 4 sum: 4
File Statistics:
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 22
Stream: column 1 section DATA start: 32 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00014-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00014-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 1 max: 1 sum: 1
File Statistics:
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 22
Stream: column 1 section DATA start: 32 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00007-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00007-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 0 max: 0 sum: 0
File Statistics:
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 22
Stream: column 1 section DATA start: 32 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00028-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00028-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 3 max: 3 sum: 3
File Statistics:
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 22
Stream: column 1 section DATA start: 32 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00050-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00050-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 6 max: 6 sum: 6
File Statistics:
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 22
Stream: column 1 section DATA start: 32 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00071-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00071-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 9 max: 9 sum: 9
File Statistics:
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 22
Stream: column 1 section DATA start: 32 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00064-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00064-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 8 max: 8 sum: 8
File Statistics:
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 22
Stream: column 1 section DATA start: 32 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00043-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00043-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 5 max: 5 sum: 5
File Statistics:
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 22
Stream: column 1 section DATA start: 32 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00021-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00021-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: true min: 2 max: 2 sum: 2
File Statistics:
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 22
Stream: column 1 section DATA start: 32 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
|
For reference, the Metadata of the CPU generated ORC file (Spark native): java -jar orc-tools-1.7.3-uber.jar meta cpu_pushdown_05.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00014-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00014-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00057-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00057-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 7 max: 7 sum: 7
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 7 max: 7 sum: 7
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00035-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00035-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 4 max: 4 sum: 4
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 4 max: 4 sum: 4
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00007-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00007-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 0 max: 0 sum: 0
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 0 max: 0 sum: 0
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00028-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00028-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 3 max: 3 sum: 3
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 3 max: 3 sum: 3
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00071-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00071-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00050-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00050-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 6 max: 6 sum: 6
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 6 max: 6 sum: 6
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00064-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00064-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 8 max: 8 sum: 8
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 8 max: 8 sum: 8
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00043-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00043-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 5 max: 5 sum: 5
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 5 max: 5 sum: 5
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00021-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00021-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 2 max: 2 sum: 2
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 2 max: 2 sum: 2
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________ |
CPU hive metadata: java -jar orc-tools-1.7.3-uber.jar meta cpu_hive_pushdown_10.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00014-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00014-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00035-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00035-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 4 max: 4 sum: 4
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 4 max: 4 sum: 4
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00057-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00057-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 7 max: 7 sum: 7
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 7 max: 7 sum: 7
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00050-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00050-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 6 max: 6 sum: 6
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 6 max: 6 sum: 6
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00007-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00007-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 0 max: 0 sum: 0
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 0 max: 0 sum: 0
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00028-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00028-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 3 max: 3 sum: 3
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 3 max: 3 sum: 3
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00071-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00071-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00043-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00043-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 5 max: 5 sum: 5
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 5 max: 5 sum: 5
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00021-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00021-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 2 max: 2 sum: 2
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 2 max: 2 sum: 2
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00064-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00064-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 8 max: 8 sum: 8
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 8 max: 8 sum: 8
Stripes:
Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section DATA start: 38 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________ |
The |
|
While testing #4638, I found that writing an ORC file with aggregates on GPU causes the test to fail with the exception below.
We need to investigate if it is a spark bug or CUDF.
Note that writing parquet file works fine.
Steps/Code to reproduce bug
With aggregat pushdown enaled, the following pytest code fails.
Current Workaround
Additional context
Pull request that raised this issue #4859
The text was updated successfully, but these errors were encountered: