[BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down #14325

thirtiseven · 2023-10-25T06:22:37Z

Describe the bug
PR #13848 added minimum/maximum and minimumNanos/maximumNanos for ORC Writer timestamp statistics. It was intended to fix #13899 that Spark does not do predicate push down for gpu generated timestamp files. However the predicate push down test is still fails after above PR was merged, see NVIDIA/spark-rapids#9075.

When trying to see the meta of related files with orc-tools, it throws Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0. And the min max are also mismatched with cpu-generated file with same data. I think it cause Spark to fail to do pushdown.

Steps/Code to reproduce bug

spark-shell with spark-rapids:

scala> import java.sql.{Date, Timestamp}
import java.sql.{Date, Timestamp}

scala> val timeString = "2015-08-20 14:57:00"
timeString: String = 2015-08-20 14:57:00

scala> val data = (0 until 10).map { i =>
     |           val milliseconds = Timestamp.valueOf(timeString).getTime + i * 3600
     |           Tuple1(new Timestamp(milliseconds))
     |         }
data: scala.collection.immutable.IndexedSeq[(java.sql.Timestamp,)] = Vector((2015-08-20 14:57:00.0,), (2015-08-20 14:57:03.6,), (2015-08-20 14:57:07.2,), (2015-08-20 14:57:10.8,), (2015-08-20 14:57:14.4,), (2015-08-20 14:57:18.0,), (2015-08-20 14:57:21.6,), (2015-08-20 14:57:25.2,), (2015-08-20 14:57:28.8,), (2015-08-20 14:57:32.4,))

scala> val df = spark.createDataFrame(data).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: timestamp]

scala> df.write.orc("ORC_PPD_GPU")

orc-tools:

java -jar orc-tools-1.9.1-uber.jar meta ORC_PPD_GPU/
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 304]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 2
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 2 hasNull: true
    Column 1: count: 2 hasNull: false min: 2015-08-20 14:57:28.799999999 max: 2015-08-20 14:57:32.399999999

File Statistics:
  Column 0: count: 2 hasNull: true
  Column 1: count: 2 hasNull: false min: 2015-08-20 14:57:28.799999999 max: 2015-08-20 14:57:32.399999999

Stripes:
  Stripe: offset: 3 data: 25 rows: 2 tail: 56 index: 64
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 57
    Stream: column 1 section PRESENT start: 67 length 5
    Stream: column 1 section DATA start: 72 length 13
    Stream: column 1 section SECONDARY start: 85 length 7
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 304 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 300]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: false min: 2015-08-20 14:57:21.599999999 max: 2015-08-20 14:57:21.599999999

File Statistics:
  Column 0: count: 1 hasNull: true
  Column 1: count: 1 hasNull: false min: 2015-08-20 14:57:21.599999999 max: 2015-08-20 14:57:21.599999999

Stripes:
  Stripe: offset: 3 data: 21 rows: 1 tail: 56 index: 64
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 57
    Stream: column 1 section PRESENT start: 67 length 5
    Stream: column 1 section DATA start: 72 length 10
    Stream: column 1 section SECONDARY start: 82 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 300 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 300]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0
	at java.sql/java.sql.Timestamp.setNanos(Timestamp.java:336)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.getMinimum(ColumnStatisticsImpl.java:1764)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.toString(ColumnStatisticsImpl.java:1808)
	at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:363)
	at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)
	at org.apache.orc.tools.FileDump.main(FileDump.java:137)
	at org.apache.orc.tools.Driver.main(Driver.java:124)

Related test cases in spark-rapids:
Support for pushing down filters for timestamp types

Expected behavior
The statistics for orc files should be reasonable and Spark should be able to do predicate push down on gpu-generated orc files.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of cuDF install: from source

The text was updated successfully, but these errors were encountered:

sameerz · 2023-10-30T21:24:08Z

@thirtiseven can you attach a minimal sample file that demonstrates the problem, and commands from the orc command line tool showing the error?

thirtiseven · 2023-10-31T03:08:50Z

@sameerz ok, a sample file: ORC_PPD_FAILED_GPU.zip

run:

java -jar orc-tools-1.9.1-uber.jar meta ORC_PPD_FAILED_GPU/

will get:

[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-0c687a13-7e91-422d-af76-ce177d66dd94-c000.snappy.orc [length: 333]
Structure for file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-0c687a13-7e91-422d-af76-ce177d66dd94-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-0c687a13-7e91-422d-af76-ce177d66dd94-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 10
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: true
Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0
	at java.sql/java.sql.Timestamp.setNanos(Timestamp.java:336)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.getMinimum(ColumnStatisticsImpl.java:1764)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.toString(ColumnStatisticsImpl.java:1808)
	at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:363)
	at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)
	at org.apache.orc.tools.FileDump.main(FileDump.java:137)
	at org.apache.orc.tools.Driver.main(Driver.java:124)

vuule · 2023-11-06T22:28:32Z

~~@thirtiseven do you have the CPU version of the invalid file? I'd like to debug the writer as it creates the invalid statistics.~~
Disregard, I can read the GPU file and write it back out.

vuule · 2023-11-07T00:24:56Z

older orc-tools works fine (maybe no nanos support yet?)

$ java -jar orc-tools-1.5.2-uber.jar meta /home/vukasin/cudf/stats_fresh.orc 
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/snap/orc/2/share/orc-tools-1.5.2-uber.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Processing data file /home/vukasin/cudf/stats_fresh.orc [length: 333]
Structure for /home/vukasin/cudf/stats_fresh.orc
File Version: 0.12 with ORIGINAL
Rows: 10
Compression: SNAPPY
Compression size: 262144
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: true
    Column 1: count: 10 hasNull: false min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

File Statistics:
  Column 0: count: 10 hasNull: true
  Column 1: count: 10 hasNull: false min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

Stripes:
  Stripe: offset: 3 data: 54 rows: 10 tail: 56 index: 64
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 57
    Stream: column 1 section PRESENT start: 67 length 6
    Stream: column 1 section DATA start: 73 length 34
    Stream: column 1 section SECONDARY start: 107 length 14
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 333 bytes
Padding length: 0 bytes
Padding ratio: 0%

vuule · 2023-11-07T00:26:11Z

FWIW, I'm also able to read the correct statistics in libcudf.
So, no local repro yet. Will try newer orc-tools next.

thirtiseven · 2023-11-07T00:48:48Z

older orc-tools works fine (maybe no nanos support yet?)

Yes, the nanosecond support is later than ORC 1.5.2

vuule · 2023-11-07T01:22:50Z

Opened #14367 with A fix for nanosecond statistics. @thirtiseven can you please run the test with this branch and see if it affects the repro?

thirtiseven · 2023-11-07T02:35:30Z

Hi @vuule , I can still repro for both orc-tools and spark PPD with this branch.

vuule · 2023-11-07T18:27:27Z

@thirtiseven is there any isolation regarding which timestamp values trigger the issue?

vuule · 2023-11-07T19:29:32Z

Found that the nanoseconds are encoded as value + 1, that why CPU reader complained about the range - zero would become -1.
Pushed a fix for the off by one to #14367, @thirtiseven please verify if it fixes the issue.
Statistics should now be correct for any nanosecond value.

thirtiseven · 2023-11-08T03:18:10Z

@vuule Thanks! The new commit no longer crashes orc-tools, and the nanosecond values look the same!

However, the predicate pushdown still somehow does not work for gpu files, seems to still have some mismatch with cpu. Any ideas?

Some new result files from cpu/gpu:
ORC_PPD_FAILED.zip

GPU meta:

java -jar orc-tools-1.9.1-uber.jar meta ORC_PPD_FAILED_GPU/
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-68ec6cc3-036b-4e78-945a-f542e184914d-c000.snappy.orc [length: 327]
Structure for file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-68ec6cc3-036b-4e78-945a-f542e184914d-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-68ec6cc3-036b-4e78-945a-f542e184914d-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 10
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: true
    Column 1: count: 10 hasNull: false min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

File Statistics:
  Column 0: count: 10 hasNull: true
  Column 1: count: 10 hasNull: false min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

Stripes:
  Stripe: offset: 3 data: 54 rows: 10 tail: 56 index: 62
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 55
    Stream: column 1 section PRESENT start: 65 length 6
    Stream: column 1 section DATA start: 71 length 34
    Stream: column 1 section SECONDARY start: 105 length 14
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 327 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

CPU meta:

java -jar orc-tools-1.9.1-uber.jar meta ORC_PPD_FAILED_CPU/
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/haoyangl/ORC_PPD_FAILED_CPU/part-00000-f0d75466-fca0-400e-b6dc-08e6f9c570a9-c000.snappy.orc [length: 345]
Structure for file:/home/haoyangl/ORC_PPD_FAILED_CPU/part-00000-f0d75466-fca0-400e-b6dc-08e6f9c570a9-c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.8.4
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_FAILED_CPU/part-00000-f0d75466-fca0-400e-b6dc-08e6f9c570a9-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 10
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>
Attributes on root.a
  spark.sql.catalyst.type: timestamp

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: false
    Column 1: count: 10 hasNull: false bytesOnDisk: 29 min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

File Statistics:
  Column 0: count: 10 hasNull: false
  Column 1: count: 10 hasNull: false bytesOnDisk: 29 min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

Stripes:
  Stripe: offset: 3 data: 29 rows: 10 tail: 48 index: 48
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 37
    Stream: column 1 section DATA start: 51 length 14
    Stream: column 1 section SECONDARY start: 65 length 15
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 345 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.4.1
________________________________________________________________________________________________________________________

revans2 · 2023-11-08T17:03:49Z

I did some digging and it is because the ORC reader is trying really hard to be cautious.

https://github.com/apache/orc/blob/7c839256470690b6b1a415a784bd924236c426a4/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L678-L683

Our writer version shows up as Original, which does not include a fix for timestamps (ORC-135).

revans2 · 2023-11-08T21:15:00Z

So I have been looking at the writer version, code and numbers that we were assigned by ORC a bit more, and I think we might be able to make this work.

https://github.com/apache/orc/blob/7c839256470690b6b1a415a784bd924236c426a4/java/core/src/java/org/apache/orc/OrcFile.java#L165-L197

and

https://github.com/apache/orc/blob/7c839256470690b6b1a415a784bd924236c426a4/c%2B%2B/include/orc/Common.hh#L94-L106

are the code that holds a lot of this version information.

The writer version is a little confusing because the C++ code and the java code use the similar names in slightly different ways, but I am going to go with the java code here, and then call out the C++ code when it appears to be different. The WriterVersion is made up of two parts. One is the FileTail.postscript.writerVersion in the protocol buffers, it is really a capability number more than anything. The other is FailTail.footer.writer in the protocol buffers. In java this is the WriterImplementation class. The numeric value says what piece of code did the write. Looking at the java code there are not a lot of places where the writerVersion is used. HIVE_12055 and ORC_135 are the only ones that are in the java code base that are not just for tests. In both of those cases the check to see if the bug is present or not assumes that the bug is only in the java WriterImplementation. So if we just say that we are CUDF we should be good to go from a java perspective.

The C++ code is different. It has an explicit disallow lists for different versions of the C++ code around bloom filters. But for the part we care about it is similar to the java code. The main difference is that the check for ORC_135 does not look at the WriterImplementation/writer at all. It only looks at the writerVersion/capabilities number. So as long as we write a writerVersion that is at least 6, which is the one that we were assigned, then we are good to go. The main thing we need to do is to make sure that we don't have any of the bugs in our code that this writerVersion number is encoding, just because the C++ code will not handle it properly. Also because we just don't want bugs in our code.

Version of 0 is what we get today by not writing anything out. It says we have all kinds of bugs in our code and some features are not supported.
Version 1 says that HIVE-8732 was fixed. Which from the comments means that we write string min/max out as UTF-8, which we already do. Oddly no reader appears to check for this.
Version 2 says that HIVE-4243 was fixed, Which adds real column names to the ORC files. We already do this too, and there are no checks about this because it is more of a feature than a bug.
Version 3 is for HIVE-12055, which is for a vectorized writer, which is a new feature, but is used to check for a bug in bloom filter reads for string data. We don't do bloom filter writes so we don't really care about it.
Version 4 is for HIVE-13083 is a bug in writing out some decimals, but there is no check for it that I found, and it looks like there is no good way to work around the bug on a read.
Version 5 is for ORC-101 which normalizes string data for bloom filters to be UTF-8. Again no checks for this in any of the readers that I saw, and we don't do bloom filters.
Version 6 is the one we care about and is for ORC-135 where min/max timestamps are written out in UTC.
Version 7 is a fix for decimal64 min and max ORC-517 where if all the numbers are negative a 0 was stored as the max. I don't think we have that problem.
Versions 8 and 9 are features we don't support and I don't think we need to worry about.

So just from this it looks like we should be able to write out the writer/writerVersion info for CUDF and version 6 and get away with it. No need to worry about breaking existing readers. But if we want to run some tests I am happy to do that.

vuule · 2023-11-08T22:38:29Z

Thank you for the analysis @revans2!
We decided in an offline discussion to disable nanosecond statistics in 23.12 and look into writing the correct version starting from 24.02, so that we can re-enable nanoseconds.

vuule · 2023-11-08T22:39:51Z

@thirtiseven I've updated #14367 to exclude nanoseconds, your tests should be passing now; please verify and I'll make the PR ready for review.

thirtiseven · 2023-11-09T04:11:38Z

@vuule I'm afraid the push down tests still failed. Maybe it is blocked by the writer version issue?

vuule · 2023-11-09T04:52:45Z

In which way does it fail?

thirtiseven · 2023-11-09T05:05:10Z

The related test cases in spark-rapids failed in the same way as before, the results indicating that predicate push down is not happening when reading GPU files.

OrcFilterSuite:
- Support for pushing down filters for boolean types gpu write gpu read
- Support for pushing down filters for boolean types gpu write cpu read
- Support for pushing down filters for boolean types cpu write gpu read
- Support for pushing down filters for decimal types gpu write gpu read !!! CANCELED !!!
  https://github.com/rapidsai/cudf/issues/13933 (OrcFilterSuite.scala:78)
- Support for pushing down filters for decimal types gpu write cpu read !!! CANCELED !!!
  https://github.com/rapidsai/cudf/issues/13933 (OrcFilterSuite.scala:90)
- Support for pushing down filters for decimal types cpu write gpu read
- Support for pushing down filters for timestamp types cpu write gpu read
- Support for pushing down filters for timestamp types gpu write cpu read *** FAILED ***
  0 was less than 10, but 10 was not less than 10 (OrcFilterSuite.scala:37)
- Support for pushing down filters for timestamp types gpu write gpu read *** FAILED ***
  0 was less than 10, but 10 was not less than 10 (OrcFilterSuite.scala:37)

) Issue #14325 Use uint when reading/writing nano stats because nanoseconds have int32 encoding (different from both unit32 and sint32, _obviously_), which does not use zigzag. sint32 uses zigzag, and unit32 does not allow negative numbers, so we can use uint since we'll never have negative nanoseconds. Also disabled the nanoseconds because it should only be written after ORC-135; we don't write the version so readers get confused if nanoseconds are there. Planning to re-enable once we start writing the version. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) URL: #14367

vuule · 2023-11-20T23:24:46Z

Opened #14458 to include the writer code and the correct version.
@thirtiseven can you please test with this branch?

thirtiseven · 2023-11-21T03:03:23Z

My test complains that:

  java.io.IOException: file:/home/haoyangl/spark-rapids/tests/target/spark341/tmp/spark-test-6c5d822b-f6bf-42f5-a08b-524746fba019/part-00002-88031666-3d16-4756-8fba-68c42a781d26-c000.snappy.orc was written by a future ORC version 0.7. This file is not readable by this version of ORC.
Postscript: footerLength: 85 compression: SNAPPY compressionBlockSize: 262144 version: 0 version: 7 metadataLength: 47 magic: "ORC"
  at org.apache.orc.impl.ReaderImpl.checkOrcVersion(ReaderImpl.java:525)
  at org.apache.orc.impl.ReaderImpl.extractPostScript(ReaderImpl.java:645)
  at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:814)
  at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:567)
  at org.apache.orc.OrcFile.createReader(OrcFile.java:385)
  at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.$anonfun$readSchema$1(OrcUtils.scala:77)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2785)
  at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.readSchema(OrcUtils.scala:77)
  at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.$anonfun$readSchema$4(OrcUtils.scala:147)
  at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
  ...

It also fails many other ORC integration tests with a similar message.

vuule · 2023-11-21T03:28:37Z

Updated the branch to write 0.6 instead of 0.7. I think that's in line with reader's expectation.
Please try again if you get a chance.
I just don't understand how everything worked before, when we wrote 0.12.

thirtiseven · 2023-11-21T03:46:26Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down #14325

[BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down #14325

thirtiseven commented Oct 25, 2023

sameerz commented Oct 30, 2023

thirtiseven commented Oct 31, 2023

vuule commented Nov 6, 2023 •

edited

Loading

vuule commented Nov 7, 2023

vuule commented Nov 7, 2023

thirtiseven commented Nov 7, 2023

vuule commented Nov 7, 2023

thirtiseven commented Nov 7, 2023 •

edited

Loading

vuule commented Nov 7, 2023

vuule commented Nov 7, 2023

thirtiseven commented Nov 8, 2023

revans2 commented Nov 8, 2023

revans2 commented Nov 8, 2023

vuule commented Nov 8, 2023

vuule commented Nov 8, 2023

thirtiseven commented Nov 9, 2023

vuule commented Nov 9, 2023

thirtiseven commented Nov 9, 2023 •

edited

Loading

vuule commented Nov 20, 2023

thirtiseven commented Nov 21, 2023

vuule commented Nov 21, 2023

thirtiseven commented Nov 21, 2023

vuule commented Nov 22, 2023

thirtiseven commented Nov 22, 2023

vuule commented Nov 22, 2023

[BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down #14325

[BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down #14325

Comments

thirtiseven commented Oct 25, 2023

sameerz commented Oct 30, 2023

thirtiseven commented Oct 31, 2023

vuule commented Nov 6, 2023 • edited Loading

vuule commented Nov 7, 2023

vuule commented Nov 7, 2023

thirtiseven commented Nov 7, 2023

vuule commented Nov 7, 2023

thirtiseven commented Nov 7, 2023 • edited Loading

vuule commented Nov 7, 2023

vuule commented Nov 7, 2023

thirtiseven commented Nov 8, 2023

revans2 commented Nov 8, 2023

revans2 commented Nov 8, 2023

vuule commented Nov 8, 2023

vuule commented Nov 8, 2023

thirtiseven commented Nov 9, 2023

vuule commented Nov 9, 2023

thirtiseven commented Nov 9, 2023 • edited Loading

vuule commented Nov 20, 2023

thirtiseven commented Nov 21, 2023

vuule commented Nov 21, 2023

thirtiseven commented Nov 21, 2023

vuule commented Nov 22, 2023

thirtiseven commented Nov 22, 2023

vuule commented Nov 22, 2023

vuule commented Nov 6, 2023 •

edited

Loading

thirtiseven commented Nov 7, 2023 •

edited

Loading

thirtiseven commented Nov 9, 2023 •

edited

Loading