-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Orc test failure (test_orc_write_statistics) #7314
Comments
This issue has been labeled |
I still see local failures on these tests with the latest branch-22.02. |
For >>> actual_max
numpy.datetime64('2001-09-09T01:55:11.684000000')
>>> stats_max
datetime.datetime(2001, 9, 9, 0, 55, 11, 684000, tzinfo=datetime.timezone.utc) I wonder if this is a DST issue or something weird like that. I see the same "off by one hour" bug in |
>>> expect.min()
_c0 1679-07-22 14:26:48.910277376
dtype: datetime64[ns]
>>> expect.loc[(expect != got).values].min()
_c0 1918-04-12 16:21:19.290896768
dtype: datetime64[ns]
🤔 |
I can confirm that all the timestamps that pass the equality check are before 1918 OR (after 1918 AND in months when DST is not active). The timestamps that fail the equality check are all (after 1918 AND in months when DST is active). Passing: >>> got.loc[(expect == got).values].head(20)
_c0
0 1726-12-14 05:28:27.950277376
2 1772-10-25 04:57:04.040828992
3 2240-12-06 07:10:00.416241920
7 2043-12-05 05:41:45.725587072
12 2202-11-28 10:44:24.912241920
13 1836-09-30 14:26:55.601483840
14 2221-12-16 02:19:01.204587072
17 1875-02-07 23:34:10.171932224
18 1701-08-10 02:37:27.225483840
19 1756-09-18 23:13:14.606828992
21 1947-04-05 14:43:22.362448384
22 2193-11-03 19:31:02.534345152
23 2071-12-11 15:06:56.374896768
24 2198-12-27 14:37:49.142448384
25 2176-02-28 10:25:24.113828992
27 2228-02-22 14:51:58.464793536
29 2028-11-18 14:42:40.120345152
33 1791-04-04 20:47:05.787793536
34 1697-11-17 00:14:11.477828992
39 1768-05-05 17:35:22.858828992 Failing (ignore the >>> got.loc[(expect != got).values].head(20)
_c0
1 2212-05-13 03:48:53.347138688
4 2164-10-05 18:55:36.266345152
5 2047-05-09 18:40:33.687896768
6 2085-09-08 08:24:20.249380608
8 NaT
9 2111-10-25 12:44:13.615896768
10 2043-03-11 09:41:44.077448384
11 1992-06-08 16:43:58.214241920
15 1943-11-21 00:31:30.609587072
16 2217-04-21 22:12:28.090448384
20 2053-08-25 04:43:17.684932224
26 2242-05-13 08:58:41.036138688
28 2015-04-14 04:38:16.134241920
30 NaT
31 2197-07-10 15:46:38.144793536
32 2072-08-14 01:35:57.770380608
35 1951-06-18 22:07:41.504793536
36 2178-04-17 12:26:21.656483840
37 2134-10-24 07:17:33.455896768
38 2101-04-28 05:54:35.066448384 One possible exception: |
I discussed this with @vuule today. I have produced a minimal test case that disagrees between PyArrow's file reader and import cudf
import datetime
import pandas as pd
import pyarrow
# Write an ORC file with a timestamp using cuDF and PyArrow
pdf = pd.DataFrame(
{"dst_timestamp": [pd.Timestamp("1981-05-18 21:00:08.262378")]}
)
cudf.DataFrame(pdf).to_orc("dst_timestamp_cudf.orc")
pyarrow.orc.write_table(
pyarrow.Table.from_pandas(pdf), "dst_timestamp_pyarrow.orc"
)
# Read each file with PyArrow and cuDF
for filename in ("dst_timestamp_cudf.orc", "dst_timestamp_pyarrow.orc"):
orcfile = pyarrow.orc.ORCFile(filename)
pdf = cudf.DataFrame(orcfile.read().to_pandas())
print(f"PyArrow reading {filename}")
print(pdf)
gdf = cudf.read_orc(filename)
print(f"cuDF reading {filename}")
print(gdf)
print(f"Difference (PyArrow - cuDF), {filename}")
print(pdf - gdf)
print()
# I have confirmed that PyArrow and cudf agree if the system is in the UTC time
# zone, but not if the system is in a time zone that observes DST like
# America/Chicago (Central, which is currently CST).
print("Current timezone:")
print(datetime.datetime.now().astimezone().tzinfo) Output:
From this, we can narrow it down to a specific case: PyArrow does not read the correct data from a file written by cuDF. I have a hypothesis about the reason for this, from earlier discussion with @vuule. Using the ORC tools, I get the following: $ orc-contents dst_timestamp_cudf.orc
{"dst_timestamp": "1981-05-18 21:00:08.262378"} $ orc-contents dst_timestamp_pyarrow.orc
{"dst_timestamp": "1981-05-18 21:00:08.262378"} The above snippet shows that the $ orc-metadata dst_timestamp_cudf.orc
{ "name": "dst_timestamp_cudf.orc",
"type": "struct<dst_timestamp:timestamp>",
"attributes": {},
"rows": 1,
"stripe count": 1,
"format": "0.12", "writer version": "original", "software version": "ORC Java",
"compression": "none",
"file length": 192,
"content": 75, "stripe stats": 26, "footer": 73, "postscript": 17,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 1,
"offset": 3, "length": 72,
"index": 12, "data": 15, "footer": 45
}
]
} $ orc-metadata dst_timestamp_pyarrow.orc
{ "name": "dst_timestamp_pyarrow.orc",
"type": "struct<dst_timestamp:timestamp>",
"attributes": {},
"rows": 1,
"stripe count": 1,
"format": "0.12", "writer version": "ORC-135", "software version": "ORC C++ 1.7.1",
"compression": "zlib", "compression block": 65536,
"file length": 294,
"content": 130, "stripe stats": 37, "footer": 100, "postscript": 23,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 1,
"offset": 3, "length": 130,
"index": 53, "data": 27, "footer": 50
}
]
} Notably, the file written by pyarrow says The ORCv1 spec is here and notes,
Here's the issue for ORC-135. I hope this is helpful diagnostic information - I am not yet sure where to look next. I am also unsure about whether to treat this as a bug in cuDF's metadata writing, or a bug in pyarrow's metadata reading. |
Thanks @bdice! |
@vuule I tried setting the Commit is here: The
Perhaps this is a bug in PyArrow? At least, I think the next step to resolve this will involve reading the PyArrow source. |
Went over the minimal repro code - statistics are not read, so the writer version should not affect the output. From this code it looks like pyarrow is affected by the current time zone in the same way for both statistics and column data. IMO this means that the issue is unrelated to statistics/writer version. |
I've found that when PyArrow reads the ORC file on my system ("GMT" timezone ID) the time is advanced by exactly one hour for a datetime on 1970-01-01 and one hour and one second for a datetime on 1969-12-31 - this is using values from the
I also noticed that the writer is setting the timezone of the writer in the slice footer as UTC regardless of what the current timezone is (and the reader also assumes that it is UTC, so cudf is consistent with itself) - if I make the following change (a quick hack, because my local timezone ID is GMT): diff --git a/cpp/src/io/orc/writer_impl.cu b/cpp/src/io/orc/writer_impl.cu
index b0e674c206..c69099d8ce 100644
--- a/cpp/src/io/orc/writer_impl.cu
+++ b/cpp/src/io/orc/writer_impl.cu
@@ -1963,7 +1963,7 @@ void writer::impl::write(table_view const& table)
(sf.columns[i].kind == DICTIONARY_V2)
? orc_table.column(i - 1).host_stripe_dict(stripe_id)->num_strings
: 0;
- if (orc_table.column(i - 1).orc_kind() == TIMESTAMP) { sf.writerTimezone = "UTC"; }
+ if (orc_table.column(i - 1).orc_kind() == TIMESTAMP) { sf.writerTimezone = "GMT"; }
}
buffer_.resize((compression_kind_ != NONE) ? 3 : 0);
pbw_.write(sf); then all the ORC tests pass. Prior to making this change, I was seeing:
I did hypothesise that PyArrow was doing something wrong, but I noticed that
If I do things the other way and write out an ORC file using Pandas / PyArrow, then read it in to cudf I get wrong timestamps in cudf: import cudf
import pyarrow as pa
import pandas as pd
from io import BytesIO
buffer = BytesIO()
s = pd.Series([710424008, -1338482640], dtype="datetime64[ns]")
df = pd.DataFrame({"s": s})
table = pa.Table.from_pandas(df)
writer = pa.orc.ORCWriter('arrow_dates.orc')
writer.write(table)
writer.close()
with open('arrow_dates.orc', 'rb') as f:
cudf_got = cudf.read_orc(f)
with open('arrow_dates.orc', 'rb') as f:
pyarrow_got = pa.orc.ORCFile(f).read()
print(cudf_got)
print(pyarrow_got.column(0)) gives:
I presume the date in 2043 came from some wraparound with nanoseconds (the test the values are from is called |
Also, with my patch from the previous comment, the only difference this makes to the written ORC file is to change "UTC" to "GMT", but unfortunately
only
|
Here is a zip file containing two ORC files and the minimal Python script to produce them. I can reproduce this failure with $ orc-contents dst_timestamp_pyarrow.orc # Expected result, written by PyArrow
{"dst_timestamp": "1981-05-18 21:00:08.262378"}
$ orc-contents dst_timestamp_cudf.orc # Undesired result, written by cuDF
{"dst_timestamp": "1981-05-18 22:00:08.262378"} Changing any of the following appears to hide the problem:
I have also eliminated the possibility that this is PyArrow's fault (Arrow dynamically links to Next, there are two immediate things to try:
|
The ORC library reads I tried printing the timezone's information after it is fetched here std::cout << "Getting writer timezone..." << std::endl;
writerTimezone.print(std::cout); I got the following:
I conclude that there's something going wrong in the way rapids-compose attempts to match the host time zone through mounting |
Resolves C++ side of #9980. The reason this PR is breaking is because Arrow only has a notion of `decimal128` (see `arrow::Type::DECIMAL`). We can still support both `decimal64` **and** `decimal128` for `to_arrow` but for `from_arrow` it only makes sense to support one of them, and `decimal128` (now that we have it) is the logical choice. Therfore, the switching of the return type of a column coming `from_arrow` from `decimal64` to `decimal128` is a breaking change. Requires: * #7314 * #9533 Authors: - Conor Hoekstra (https://github.com/codereport) Approvers: - Devavret Makkar (https://github.com/devavret) - Mike Wilson (https://github.com/hyperbolic2346)
@bdice is this reproducible in contexts outside of compose? IIRC we need a specific container configuration to observe this, right? I haven't seen this in a long time (it doesn't show up in devcontainers), but I don't recall the severity of the underlying issue (is it data corruption?) to know whether it's important enough to keep open. |
I haven’t seen this outside of compose. I do not see this with devcontainers. It’s possibly some weird behavior involving time zone support in Docker. I’m happy to close this. |
I also don't see this in devcontainers. Glad it's not showing up anymore, even if it's not clear exactly why! |
The following tests fail reliably for me:
Stack trace:
The text was updated successfully, but these errors were encountered: