-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ORC read/write is incompatible with Spark in some corner case(s) #11525
Comments
Other data that can reproduce:
|
@ttnghia can you add the output resulting from these values?
|
If these timestamps are written in Spark CPU, the read timestamp in cudf are always |
Thanks @ttnghia, it looks like we find the same issue from the python side with
++ does not seem to affect parquet 🤔 |
This error appears to affect all timestamps before the start of the UTC epoch in 1970 and with microsecond count between 0 and 1000.
|
Probably related: #5529 (comment) |
Looks like the comment above is not directly related to the root cause. |
Fixes #11525 Contains a chain of fixes: 1. Allow negative nanoseconds in negative timestamps - aligns writer with pyorc; 2. Limit seconds adjustment to positive nanoseconds - fixes the off-by-one issue reported in #11525; 3. Fix the decode of large uint64_t values (>max `int64_t`) - fixes reading of cuDF encoded negative nanoseconds; 4. Avoid mode 2 encode when the base value is larger than max `int64_t` - follows the specs and fixes reading of negative nanoseconds using non-cuDF readers. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: #11586
Sorry guys. Our tests just discovered new failed cases for these values:
I'm not sure if this is still the old bug or a new one. Edit: Added more cases. |
What are the timestamps in the comment? Two cases that fail, correct timestamp and the cudf result, something else? |
These values if written by Spark CPU then reading by cudf will produce different values (off-by-one second). For example:
Note that |
No repro so far, all three timestamps are read correctly with cuDF. @ttnghia can you please share a more detailed repro instructions? |
Reproducing: Write in Spark (
Read in cudf:
|
…or (#11699) Closes #11525 Not sure why, but the apache Java ORC reader does the following when reading negative timestamps: https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1284-L1285 This detail does not impact cuDF and pyorc writers (reading cudf files with apache reader already works) because these libraries write negative timestamps with negative nanoseconds. This PR modifies the ORC reader behavior to match the apache reader so that cuDF correctly reads ORC files written by the apache reader. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Ashwin Srinath (https://github.com/shwina) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) - Elias Stehle (https://github.com/elstehle) - Nghia Truong (https://github.com/ttnghia) URL: #11699
When read/write timestamp data into ORC file format, the following cases were discovered. For the input column of one row of timestamp type
1839-12-24 03:58:55.000826
:Write in Spark + read in libcudf:
Write in libcudf + read in Spark:
In particular, the number of
second
is wrong. The inputsecond
is55
but it is messed up somehow.Of course for that input timestamp value, results of both writing and reading in libcudf are the same as the input.
The text was updated successfully, but these errors were encountered: