Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DateTime writer support #108

Open
liamlundy opened this issue Nov 13, 2020 · 4 comments
Open

DateTime writer support #108

liamlundy opened this issue Nov 13, 2020 · 4 comments

Comments

@liamlundy
Copy link

I know that you mention in the README that a few types are not supported by the writer as of now including date-like types. I didn't see any issues referencing this so I wanted to add one to track its status.

Is this planned to be supported soon? If not, what needs to happen in order to support this? I might be able to create a PR at some point.

@xiaodaigh
Copy link
Contributor

Hmm, I think you can look at the writer.jl file and also it would be good to link the relevant DateTime support page from parquet format.

Can you write some parquet using datetime in python or R can and provide some simple files for testing?

@liamlundy
Copy link
Author

Okay I was able to create some example files using pyarrow. I'll include the code I used to generate those examples.

Link to Apache Parquet docs about the Date / Time Logical Types: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

I'm not sure how soon I'll be able to dig into this, but I'll leave this info here for myself or anyone else that wants to take a crack at it in the meantime. It looks like there is also some work to be done to support reading a few of the date / time types as well.

Python script for generating parquet files with datetime columns:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


if __name__ == "__main__":
    weeks = pd.date_range(start="2000-01-01", periods=26, freq="W")
    hours = pd.date_range(start="2000-01-01", periods=26, freq="H")
    data = pd.DataFrame(
        {
            "ns": list(weeks),
            "ms": list(weeks),
            "us": list(weeks),
            "date": weeks.date,
            "time": hours.time,
        }
    )

    schema = pa.schema(
        [
                pa.field("ns", pa.timestamp("ns")),
                pa.field("ms", pa.timestamp("ms")),
                pa.field("us", pa.timestamp("us")),
                pa.field("date", pa.date64()),
                pa.field("time", pa.time64("ns")),
        ]
    )

    table = pa.Table.from_pandas(data, schema=schema)

    pq.write_table(table, "example-v1.parquet")
    pq.write_table(table, "example-v2.parquet", version="2.0")

    v1_file = pq.ParquetFile('example-v1.parquet')
    v2_file = pq.ParquetFile('example-v2.parquet')
    print(v1_file.schema)
    print(v2_file.schema)

Output:

<pyarrow._parquet.ParquetSchema object at 0x10f054c80>
required group field_id=0 schema {
  optional int64 field_id=1 ns (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=2 ms (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=3 us (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int32 field_id=4 date (Date);
  optional int64 field_id=5 time (Time(isAdjustedToUTC=true, timeUnit=nanoseconds));
}

<pyarrow._parquet.ParquetSchema object at 0x10f05d690>
required group field_id=0 schema {
  optional int64 field_id=1 ns (Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=2 ms (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=3 us (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int32 field_id=4 date (Date);
  optional int64 field_id=5 time (Time(isAdjustedToUTC=true, timeUnit=nanoseconds));
}

@the-noble-argon
Copy link

Being able to write datetimes is a crucial for a lot of data science applications. I'm still forced to call Python for this which makes it very hard to scale any timeseries Julia solution that has to interact with other components that speak parquet.

@xiaodaigh
Copy link
Contributor

I wonder if the parquet2.jl implementation solves this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants