[C++][Parquet] Support DurationType in writing/reading parquet #23117

asfimport · 2019-10-03T13:51:11Z

Currently this is not supported:

In [37]: table = pa.table({'a': pa.array([1, 2], pa.duration('s'))}) 

In [39]: table
Out[39]: 
pyarrow.Table
a: duration[s]

In [41]: pq.write_table(table, 'test_duration.parquet')
...
ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: duration[s]

There is no direct mapping to Parquet logical types. There is an INTERVAL type, but this more matches Arrow's ( YEAR_MONTH or DAY_TIME) interval type.

But, those duration values could be stored as just integers, and based on the serialized arrow schema, it could be restored when reading back in.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

GitHub Pull Request #12449

_{Note: This issue was originally created as ARROW-6780. Please see the migration documentation for further details.}

asfimport · 2020-12-19T15:52:24Z

Neville Dipale / @nevi-me:
Hi @jorisvandenbossche, I'm having a similar issue/dilemma on the Rust side.

Given that we serialize the Arrow schema and store it in the Parquet metadata, it becomes easier to write intervals as FixedLenBinary. On the read side, we take guidance from the Arrow schema on which IntervalUnit to use.

The problem comes if we read an interval without an Arrow schema. I think it'd be the same with the Duration type.

I've looked at various JIRAs here, and saw that Pandas stores Intervals as an extension array with nested storage (https://issues.apache.org/jira/browse/ARROW-9078).

Given that the Duration type is not composite, how about we store it as an INT32 or INT64 depending on the resolution, then rely on ARROW::schema to roundtrip it correctly? CC @emkornfield as you've recently worked on this part of the C++ impl.

asfimport · 2021-03-01T05:15:28Z

Micah Kornfield / @emkornfield:
For duration I like int64 + arrow schema for round tripping. we might want to add some extra metadata to indicate it is a duration separately (I need to review the parquet specification to see what is allowed in this area).

asfimport · 2021-06-29T18:41:53Z

Jorge Leitão / @jorgecarleitao:
I do not think extra metadata is needed: store them as i64, and load them using the arrow schema seems reasonable: the schema contains the time unit, which is sufficient to guarantee a roundtrip.

asfimport · 2022-01-07T19:18:45Z

P:
@jorisvandenbossche / @jorgecarleitao / @emkornfield : any planned movement on this issue? Coming from the Pandas side, it's quite inconvenient having to special-case types handled by Pandas but not by Arrow/Parquet.

asfimport · 2022-02-17T15:30:10Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 12449
#12449

asfimport closed this as completed Feb 17, 2022

asfimport assigned jorisvandenbossche Jan 10, 2023

This was referenced Mar 23, 2024

[BUG] Unable to write timedelta64[s] type correctly with parquet writer rapidsai/cudf#13409

Closed

[BUG] Parquet reader unable to read duration types written by pyarrow rapidsai/cudf#13410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Support DurationType in writing/reading parquet #23117

[C++][Parquet] Support DurationType in writing/reading parquet #23117

asfimport commented Oct 3, 2019

asfimport commented Dec 19, 2020

asfimport commented Mar 1, 2021

asfimport commented Jun 29, 2021

asfimport commented Jan 7, 2022

asfimport commented Feb 17, 2022

[C++][Parquet] Support DurationType in writing/reading parquet #23117

[C++][Parquet] Support DurationType in writing/reading parquet #23117

Comments

asfimport commented Oct 3, 2019

PRs and other links:

asfimport commented Dec 19, 2020

asfimport commented Mar 1, 2021

asfimport commented Jun 29, 2021

asfimport commented Jan 7, 2022

asfimport commented Feb 17, 2022