Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Support DurationType in writing/reading parquet #23117

Closed
asfimport opened this issue Oct 3, 2019 · 5 comments
Closed

[C++][Parquet] Support DurationType in writing/reading parquet #23117

asfimport opened this issue Oct 3, 2019 · 5 comments

Comments

@asfimport
Copy link
Collaborator

Currently this is not supported:

In [37]: table = pa.table({'a': pa.array([1, 2], pa.duration('s'))}) 

In [39]: table
Out[39]: 
pyarrow.Table
a: duration[s]

In [41]: pq.write_table(table, 'test_duration.parquet')
...
ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: duration[s]

There is no direct mapping to Parquet logical types. There is an INTERVAL type, but this more matches Arrow's ( YEAR_MONTH or DAY_TIME) interval type.

But, those duration values could be stored as just integers, and based on the serialized arrow schema, it could be restored when reading back in.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-6780. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Neville Dipale / @nevi-me:
Hi @jorisvandenbossche, I'm having a similar issue/dilemma on the Rust side.

Given that we serialize the Arrow schema and store it in the Parquet metadata, it becomes easier to write intervals as FixedLenBinary. On the read side, we take guidance from the Arrow schema on which IntervalUnit to use.

The problem comes if we read an interval without an Arrow schema. I think it'd be the same with the Duration type.

I've looked at various JIRAs here, and saw that Pandas stores Intervals as an extension array with nested storage (https://issues.apache.org/jira/browse/ARROW-9078).

Given that the Duration type is not composite, how about we store it as an INT32 or INT64 depending on the resolution, then rely on ARROW::schema to roundtrip it correctly? CC @emkornfield  as you've recently worked on this part of the C++ impl.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
For duration I like int64 + arrow schema for round tripping.  we might want to add some extra metadata to indicate it is a duration separately (I need to review the parquet specification to see what is allowed in this area).

@asfimport
Copy link
Collaborator Author

Jorge Leitão / @jorgecarleitao:
I do not think extra metadata is needed: store them as i64, and load them using the arrow schema seems reasonable: the schema contains the time unit, which is sufficient to guarantee a roundtrip.

@asfimport
Copy link
Collaborator Author

P:
  @jorisvandenbossche / @jorgecarleitao / @emkornfield : any planned movement on this issue? Coming from the Pandas side, it's quite inconvenient having to special-case types handled by Pandas but not by Arrow/Parquet.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 12449
#12449

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants