-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive connector seems to change precision of timestamp. #5564
Comments
i think it was never supported. do you have particular SQL commands in mind, that used to work and they no longer do? |
In version 335SELECT to_unixtime(ts) as unix_s,ts FROM rtb.bid_requests where c='fr' and d='10' and y='2020' and h='10' limit 10; Timestamps are in milliseconds. I'm tired, so it's not a matter of seconds sorry. ts has datatype timestamp(3) in this version. In version 344I've finally set up another cluster and here are the result on latest version SELECT to_unixtime(ts) as unix_s,ts FROM rtb.bid_requests where c='fr' and d='10' and y='2020' and h='10' limit 10; ts has also datatype timestamp(3) in this version |
So to recap,
|
What file format are you using to store your data? This is probably related to #4974. It's possible we missed a code path for one of the formats. |
We use parquet files. |
Would you mind posting the output of |
Hi, here is the result of parquet-tools meta :
|
I think what's happening is that the A few of possible ways to fix this:
|
Timestamps are written using spark as Long/BIGINT, type timestamp is configured in hive/presto during Table creation. |
I'm curious about how Hive interprets those values. Do you have a Hive installation that you could try with? In a nutshell, I'm trying to determine whether this has worked in the past by chance or whether interpreting those values as milliseconds is the expected behavior from Hive's perspective. |
Our setup is really small since we mostly use only metastore, but the result is the following : Someone has the same issue here : And it seems a bug is opened on hive : https://issues.apache.org/jira/browse/HIVE-15079 |
From the linked page (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Timestamps), it would appear that even the previous behavior of interpreting values as milliseconds was incorrect:
|
I think spark follow the default java behavior of sql timestamp : https://docs.oracle.com/en/java/javase/11/docs/api/java.sql/java/sql/Timestamp.html, Long construct a millisecond timestamp. |
Exactly, that's what make me think, at first, that we write our timestamp in seconds, and that timestamp(0) should be supported to follow hive spec. |
see also #5483 (comment)
I think in Parquet writer some mappings work by a chance only. I once tried to clean it up, but did not finish (WIP #2840) |
So yes it seems it works by chance only. |
@luhhujbb, thanks for the research. It appears to me that the right course of action is to disallow that mapping. Unfortunately, it means you’d have to adjust your Spark processes to write the correct type with annotations. |
@martint I agree it will be cleaner that way. Many Thanks. |
We have also hit this issue when using Presto Thrift Connector. Our system (https://github.com/kelindar/talaria) is exporting timestamp as
We narrowed it down to Presto and rolled back our Presto to version |
We got the same issue - we use PySpark 3 + hive 2.6 + parquet. We had two environments - version 333 and version 343. The timestamp column in version 333 is displayed as timestamp but in 343 it is displayed as timestamp(3). Our Spark job use the default timestamp format (yyyy-MM-dd HH:mm:ss ) in Spark and the hive table uses the default timestamp (yyyy-MM-dd HH:mm:ss ) as well. I found any timezone conversion function not working in version 343 with the timestamp(3) column |
@kelindar can you please file a separate issue for this? |
@MichaelZhao9824 at first this seems loosely related to original problem reported here. |
This issue comes from the following code path :
Now all connectors convert internally all timestamps in microseconds. |
@luhhujbb thanks for the update. Let's keep the issue open unless we decide there is no more work tbd. There is this idea open for example:
|
@luhhujbb do u have a fix? |
@tooptoop4, we are converting our unanottated int64 parquet fields to annotated int64 milliseconds timestamp parquet fields in existing data and we rewrite our sparks job to correctly write timestamp in parquet. It's a bit painful but's we think it's the cleanest solution. |
Presto to version 350 seems to work with it |
Is there any solution? |
I'm not sure it's an issue, and maybe somewhere in the documentation it's written but I don't find it.
Presto version
All version with new timestamp design with variable precision
Issue
I have a lots of tables with seconds precision timestamp, ie timestamp(0)
As mentionned here #37 : "Typically databases store it internally as seconds since epoch in some fixed timezone (usually UTC)", timestamp are usually in seconds, precision is optionnal.
As mentionned here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types, hive timestamp type supports traditional UNIX timestamp with optional nanosecond precision. I don't know if is up to date with recent hive version.
So maybe I've missed something in documentation, but I feel hive connector doesn't support Timestamp ie Timestamp(0) anymore. In recent version only timestamp_millis (Timestamp(3)) and timestamp_nanos (Timestamp(6)).
Is basic timestamp, ie timestamp(0) dropped from hive connector by design in recent version, or is it a bug ?
Many thanks !
The text was updated successfully, but these errors were encountered: