Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persisting Arrow timestamps with Parquet produces missing TIMESTAMP in schema #1920

Closed
maxcountryman opened this issue Jun 21, 2022 · 2 comments
Labels
parquet Changes to the parquet crate question Further information is requested

Comments

@maxcountryman
Copy link

Describe the bug
I'm unable to persist fields represented as e.g. Timestamp in Arrow to recognized timestamps in the written Parquet.

To Reproduce
I've written a simple utility for converting WARC files to Parquet. Using this, you'll produce Parquet which looks something like this:

message arrow_schema {
  required binary id (STRING);
  required int32 content_length (INTEGER(32,false));
  required int64 date;
  required binary type (STRING);
  optional binary content_type (STRING);
  optional binary concurrent_to (STRING);
  optional binary block_digest (STRING);
  optional binary payload_digest (STRING);
  optional binary ip_address (STRING);
  optional binary refers_to (STRING);
  optional binary target_uri (STRING);
  optional binary truncated (STRING);
  optional binary warc_info_id (STRING);
  optional binary filename (STRING);
  optional binary profile (STRING);
  optional binary identified_payload_type (STRING);
  optional int32 segment_number (INTEGER(32,false));
  optional binary segment_origin_id (STRING);
  optional int32 segment_total_length (INTEGER(32,false));
  optional binary body;
}

Expected behavior
Looking at Parquet produced from a sample datasets (of NYC taxi data), their Parquet has the correctly annotated TIMESTAMP:

message schema {
  optional binary hvfhs_license_num (STRING);
  optional binary dispatching_base_num (STRING);
  optional binary originating_base_num (STRING);
  optional int64 request_datetime (TIMESTAMP(MICROS,false));
  optional int64 on_scene_datetime (TIMESTAMP(MICROS,false));
  optional int64 pickup_datetime (TIMESTAMP(MICROS,false));
  optional int64 dropoff_datetime (TIMESTAMP(MICROS,false));
  optional int64 PULocationID;
  optional int64 DOLocationID;
  optional double trip_miles;
  optional int64 trip_time;
  optional double base_passenger_fare;
  optional double tolls;
  optional double bcf;
  optional double sales_tax;
  optional double congestion_surcharge;
  optional double airport_fee;
  optional double tips;
  optional double driver_pay;
  optional binary shared_request_flag (STRING);
  optional binary shared_match_flag (STRING);
  optional binary access_a_ride_flag (STRING);
  optional binary wav_request_flag (STRING);
  optional binary wav_match_flag (STRING);
}
@maxcountryman
Copy link
Author

I believe I fixed this in this commit. However, it's not clear why milliseconds would work whereas seconds does not. Is this intended behavior?

@paddyhoran paddyhoran added the parquet Changes to the parquet crate label Jun 22, 2022
@tustvold
Copy link
Contributor

tustvold commented Jun 23, 2022

Parquet does not have a way to represent timestamps with a base of seconds. As a result if you write timestamps with seconds as a base, we can only embed this information in the arrow schema and not the parquet schema.

See #1666 for more context, and discussion around an optional coerce_types feature that would automatically cast unsupported types for maximum compatibility.

If you wish to be compatible with non-arrow parquet readers, you will need to cast the arrow array to a supported time base prior to writing it, or build the array using a supported time base as your linked commit does.

@tustvold tustvold added question Further information is requested and removed bug labels Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants