Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU writing ORC columns statistics #4860

Closed
amahussein opened this issue Feb 24, 2022 · 8 comments · Fixed by #5715
Closed

[BUG] GPU writing ORC columns statistics #4860

amahussein opened this issue Feb 24, 2022 · 8 comments · Fixed by #5715
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@amahussein
Copy link
Collaborator

While testing #4638, I found that writing an ORC file with aggregates on GPU causes the test to fail with the exception below.
We need to investigate if it is a spark bug or CUDF.
Note that writing parquet file works fine.

E                   py4j.protocol.Py4JJavaError: An error occurred while calling o356.collectToPython.
E                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 5.0 failed 1 times, most recent failure:
Lost task 7.0 in stage 5.0 (TID 163) (10.136.8.146 executor driver): org.apache.spark.SparkException:
Cannot read columns statistics in file: file:/tmp/ahussein/pyspark_tests/c240m5-01-gw1-72665-1063323372/pushdown.orc/p=2/part-00064-47d45f8a-d4ec-42ad-8bfc-0176ade0caba.c000.snappy.orc. Please consider disabling ORC aggregate push down by setting 'spark.sql.orc.aggregatePushdown' to false.
E                       at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.createAggInternalRowFromFooter(OrcUtils.scala:432)
E                       at org.apache.spark.sql.execution.datasources.v2.orc.OrcPartitionReaderFactory$$anon$3.$anonfun$batch$2(OrcPartitionReaderFactory.scala:221)
E                       at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2735)
E                       at org.apache.spark.sql.execution.datasources.v2.orc.OrcPartitionReaderFactory$$anon$3.batch$lzycompute(OrcPartitionReaderFactory.scala:218)
E                       at org.apache.spark.sql.execution.datasources.v2.orc.OrcPartitionReaderFactory$$anon$3.batch(OrcPartitionReaderFactory.scala:217)
E                       at org.apache.spark.sql.execution.datasources.v2.orc.OrcPartitionReaderFactory$$anon$3.get(OrcPartitionReaderFactory.scala:230)
E                       at org.apache.spark.sql.execution.datasources.v2.orc.OrcPartitionReaderFactory$$anon$3.get(OrcPartitionReaderFactory.scala:215)
E                       at org.apache.spark.sql.execution.datasources.v2.PartitionedFileReader.get(FilePartitionReaderFactory.scala:57)
E                       at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.get(FilePartitionReader.scala:89)
E                       at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.next(DataSourceRDD.scala:108)
E                       at org.apache.spark.sql.execution.datasources.v2.MetricsBatchIterator.next(DataSourceRDD.scala:154)
E                       at org.apache.spark.sql.execution.datasources.v2.MetricsBatchIterator.next(DataSourceRDD.scala:151)
E                       at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
E                       at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
E                       at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$next$1(GpuExec.scala:198)
E                       at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
E                       at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
E                       at com.nvidia.spark.RebaseHelper$.withResource(RebaseHelper.scala:25)
E                       at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:197)
E                       at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:261)
E                       at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:237)
E                       at scala.Option.getOrElse(Option.scala:189)
E                       at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:235)
E                       at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:181)
E                       at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:288)
E                       at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:304)
E                       at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
E                       at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
E                       at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
E                       at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
E                       at org.apache.spark.scheduler.Task.run(Task.scala:136)
E                       at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
E                       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475)
E                       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
E                       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
E                       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
E                       at java.lang.Thread.run(Thread.java:748)
E                   Caused by: java.util.NoSuchElementException
E                       at java.util.LinkedList.removeFirst(LinkedList.java:270)
E                       at java.util.LinkedList.remove(LinkedList.java:685)
E                       at org.apache.spark.sql.execution.datasources.orc.OrcFooterReader.convertStatistics(OrcFooterReader.java:54)
E                       at org.apache.spark.sql.execution.datasources.orc.OrcFooterReader.readStatistics(OrcFooterReader.java:45)
E                       at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.createAggInternalRowFromFooter(OrcUtils.scala:428)
E                       ... 36 more

Steps/Code to reproduce bug
With aggregat pushdown enaled, the following pytest code fails.

    def do_orc_scan(spark, path, agg):
        spark.range(10).selectExpr("id", "id % 3 as p").write
                                .partitionBy("p")
                                .mode("overwrite")
                                .orc(data_path)
        df = spark.read.orc(path).selectExpr('{}(p)'.format(agg))
        return df

Current Workaround

  • writing the file in CPU session

Additional context
Pull request that raised this issue #4859

@amahussein amahussein added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 24, 2022
@jlowe jlowe added the P0 Must have for release label Feb 24, 2022
@amahussein amahussein self-assigned this Feb 28, 2022
@sameerz sameerz added this to the Feb 28 - Mar 18 milestone Mar 1, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Mar 1, 2022
@tgravescs
Copy link
Collaborator

so I think the problem here is that Spark always writes the statistics but at any point later someone might use the feature "spark.sql.orc.aggregatePushdown" during read. This could be in a totally separate job where they don't have the flag enabled during the write. Also I assume the same problem exists with Parquet.

For this particular test we could see the flag is set and then fallback to the cpu for the write but it wont help the case where the write is separate job where the flag isn't set. We probably need to make sure its documented.

@amahussein
Copy link
Collaborator Author

amahussein commented Mar 8, 2022

Thanks @tgravescs!

To reproduce the exception:

Configuration:

conf={'spark.rapids.sql.format.orc.write.enabled": 'true',
                    'spark.sql.orc.aggregatePushdown': 'true',
                    "spark.sql.sources.useV1SourceList": "",
                    "spark.sql.orc.impl": "native"}

Note that spark.sql.orc.impl=hive does not fail.

The exception is raised when the file is written by GPU and the query on read will be pushed down to the ORC file.
For example, the following test fails because of .selectExpr('count(p)'):

@ignore_order
@pytest.mark.parametrize('orc_impl', ["native"])
def test_orc_write_with_aggregate_pushdown_get_file(spark_tmp_path, orc_impl):
    data_path = spark_tmp_path + '/ORC_DATA/pushdown_05.orc'
    assert_gpu_and_cpu_writes_are_equal_collect(
            lambda spark, path: spark.range(10).selectExpr("id", "id % 3 as p").write.partitionBy("p").mode('overwrite').orc(path),
            lambda spark, path: spark.read.orc(path).selectExpr('count(p)'),
            data_path,
            conf={'spark.rapids.sql.format.orc.write.enabled': 'true',
                    'spark.sql.orc.aggregatePushdown': 'true',
                    "spark.sql.sources.useV1SourceList": "",
                    "spark.sql.orc.impl": orc_impl})

Is the file generated by the GPU incorrect?

Yes.
I captured the files generated by both GPU and CPU.
Spark throws the same exception reading the generated GPU.orc
orc_file_pusheddown_enabled.tar.gz

Having a separate read/write jobs

I did a test in two steps:

1- set write job with aggregatePushdown
2- a separate job to read the file disabling aggregatePushdown.

For Spark, this does not raise any exception.

Current state

GPU does not generate the correct ORC file schema expected for the pushedDownaggregates.

The problem here that all files created by GPU will throw exception when the file
is scanned with pushedDownaggregates=true and aggregates are pushed down.

Falling back To CPU on ORC write?

There is a probability that the read job has pushedDownaggregates=true;
thus we have to fallback to CPU regardless of the value of aggregatePushDown. Clearly, this is not ideal.

I will continue digesting the changes introduced to Spark and follow the ORC writing part on the GPU.

@amahussein
Copy link
Collaborator Author

Cudf-FEA-10075: Add File Statistic when writing the ORC file is still not implemented.
This will be a blocker to solve the current bug.

@amahussein
Copy link
Collaborator Author

Metadata of the GPU generated file.

File statistics is empty

ahussein@c240m5-01:~/workspace/repos/arapids-4860/debug_orc$ java -jar orc-tools-1.7.3-uber.jar meta gpu_pushdown_05.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00057-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00057-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: true min: 7 max: 7 sum: 7

File Statistics:

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 22
    Stream: column 1 section DATA start: 32 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00035-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00035-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: true min: 4 max: 4 sum: 4

File Statistics:

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 22
    Stream: column 1 section DATA start: 32 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00014-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=1/part-00014-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: true min: 1 max: 1 sum: 1

File Statistics:

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 22
    Stream: column 1 section DATA start: 32 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00007-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00007-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: true min: 0 max: 0 sum: 0

File Statistics:

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 22
    Stream: column 1 section DATA start: 32 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00028-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00028-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: true min: 3 max: 3 sum: 3

File Statistics:

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 22
    Stream: column 1 section DATA start: 32 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00050-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00050-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: true min: 6 max: 6 sum: 6

File Statistics:

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 22
    Stream: column 1 section DATA start: 32 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00071-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=0/part-00071-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: true min: 9 max: 9 sum: 9

File Statistics:

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 22
    Stream: column 1 section DATA start: 32 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00064-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00064-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: true min: 8 max: 8 sum: 8

File Statistics:

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 22
    Stream: column 1 section DATA start: 32 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00043-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00043-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: true min: 5 max: 5 sum: 5

File Statistics:

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 22
    Stream: column 1 section DATA start: 32 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00021-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc [length: 157]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/gpu_pushdown_05.orc/p=2/part-00021-88527f01-9dc1-434a-8d16-fa424ba7af13.c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: true min: 2 max: 2 sum: 2

File Statistics:

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 29
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 22
    Stream: column 1 section DATA start: 32 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 157 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

@amahussein
Copy link
Collaborator Author

amahussein commented Mar 9, 2022

For reference, the Metadata of the CPU generated ORC file (Spark native):

java -jar orc-tools-1.7.3-uber.jar meta cpu_pushdown_05.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00014-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00014-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00057-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00057-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 7 max: 7 sum: 7

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 7 max: 7 sum: 7

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00035-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=1/part-00035-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 4 max: 4 sum: 4

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 4 max: 4 sum: 4

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00007-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00007-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 0 max: 0 sum: 0

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 0 max: 0 sum: 0

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00028-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00028-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 3 max: 3 sum: 3

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 3 max: 3 sum: 3

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00071-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00071-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00050-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=0/part-00050-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 6 max: 6 sum: 6

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 6 max: 6 sum: 6

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00064-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00064-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 8 max: 8 sum: 8

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 8 max: 8 sum: 8

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00043-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00043-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 5 max: 5 sum: 5

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 5 max: 5 sum: 5

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00021-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_pushdown_05.orc/p=2/part-00021-41a9b295-c929-44e4-b2b5-c01b01b95ed0.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 2 max: 2 sum: 2

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 2 max: 2 sum: 2

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

@amahussein
Copy link
Collaborator Author

CPU hive metadata:

java -jar orc-tools-1.7.3-uber.jar meta cpu_hive_pushdown_10.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00014-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00014-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00035-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00035-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 4 max: 4 sum: 4

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 4 max: 4 sum: 4

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00057-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=1/part-00057-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 7 max: 7 sum: 7

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 7 max: 7 sum: 7

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00050-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00050-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 6 max: 6 sum: 6

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 6 max: 6 sum: 6

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00007-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00007-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 0 max: 0 sum: 0

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 0 max: 0 sum: 0

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00028-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00028-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 3 max: 3 sum: 3

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 3 max: 3 sum: 3

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00071-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=0/part-00071-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00043-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00043-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 5 max: 5 sum: 5

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 5 max: 5 sum: 5

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00021-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00021-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 2 max: 2 sum: 2

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 2 max: 2 sum: 2

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

Processing data file file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00064-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc [length: 237]
Structure for file:/home/ahussein/workspace/repos/arapids-4860/debug_orc/cpu_hive_pushdown_10.orc/p=2/part-00064-b5d0b4e2-76c1-4aad-82f6-dc847c0dc3c5.c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.3
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 8 max: 8 sum: 8

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 8 max: 8 sum: 8

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 237 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0
________________________________________________________________________________________________________________________

@amahussein
Copy link
Collaborator Author

The GpuOrcWriter calls Table.writeORCChunked(builder.build(), this) which is the Cudf writer in chunked mode.
The Cudf does not add File statistics in chunk mode.
See related issue Cudf-bug-5826: ORC file-level statistics omitted with chunked writes.

@amahussein
Copy link
Collaborator Author

amahussein commented Mar 15, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants