-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Repeated fields in Record APIs #2394
Comments
I'm struggling to reproduce this (see #2422), could you perhaps provide a code-sample of how you are reading this file? Are you perhaps using the lower level https://docs.rs/parquet/latest/parquet/column/reader/struct.GenericColumnReader.html interface, as if so you will need to perform record de-shredding yourself based on the returned repetition and definition level data. |
I'm using the Row reader api that should be able to read this as a list, but of course it can't because the design of the row reader is based on building a hierarchy of readers based on the hierarchical structure of a "Group" Parquet type in the file schema. But this case involves a "list" that isn't defined using a "Group" type, it is defined by a primitive type that has a "Repetition" of "Repeated" with appropriate repetition levels, this type of list encoding is valid per the Parquet format but not implemented by parquet-rs. |
I'm afraid I'm not very familiar with this API, it has effectively been orphaned since 2018, but I wasn't under the impression that it supported lists at all. Is this perhaps a feature request to add support for this? |
Is get_row_iter() deprecated? since when? |
Sorry my comment was unnecessarily inflammatory, it isn't deprecated because there isn't a drop-in replacement for it. However, I think it is fair to say it is not getting the same degree of care and attention as other parts of the codebase, in particular the arrow interface and the low-level APIs that builds upon. I would be extremely happy if you or someone else wanted to help maintain this part of the codebase, but I'm just trying to set expectations that at the moment it is effectively orphaned. |
I see, so which of the reader APIs are considered active? AFAIK the other reader APIs are:
|
The following is potentially somewhat subjective, so take with a grain of salt, but is I think fair
The low-level column API is still actively developed, in so much as the arrow internals make use of it. However, it is worth noting that they decode to their own buffer implementations instead of using
I presume you're referring to the file APIs here. If so these are still actively developed, as they are used by the arrow API without any major caveats when operating on local files.
The only high-level interface that I would describe as actively maintained is arrow, and is where most development effort is currently focused, with significant effort expended to make it fast, feature complete, and add advanced functionality such as predicate pushdown, async IO, etc... Whilst arrow may be a somewhat heavy dependency, there are ongoing improvements in this space, and I believe the additional performance, especially for dictionary encoded or variable length types, more than makes up for this. Perhaps we could add more feature flags to arrow-rs to reduce the size of it as a dependency, would it then work for your use-case? |
I'd think the Row level interface is central to the implementation, without it, it feels like this isn't really a proper parquet implementation library, rather a helper library mainly built to serve Arrow. Column reader and page reader (directly, not through Arrow) are also important. Arrow is well and fine, but Parquet is consumed by other in-memory columnar representations and other query engines as well. We completely disable the arrow feature when using the parquet crate. if parquet-rs design goals are specifically to serve Arrow, this should be clearly stated by the core team so folks taking dependency on it know what they're buying into. I'd think it would better serve the community to break parquet off of arrow-rs into a stand-alone project, arrow-rs can take a dependency on it. Thanks. |
There are no plans to remove the row-level APIs, in fact significant additional effort has been expended to preserve them, and we would welcome any contributions from the community to continue to improve them. 🙂 |
Thanks for raising this issue @zeevm -- While it is true that the current most active contributors to the parquet crate seem to be focused on the arrow usecase, I don't think it is accurate to say that the design goals are to serve arrow per se I think having someone such as yourself help design and contribute APIs (or docs or examples) that make it easier and clearer how to use The reason for the |
Another challenge with maintaining both on the same release schedule is that it isn't always clear when a major version bump -that technically should be a breaking change - is really a breaking change. e.g. when major version is increased because of breaking change in arrow only (and not in parquet), and parquet users are looking through code and docs to figure out what was the breaking change and how they should adjust their code to account for it. |
I agree. This is also an area that could use additional improvement and we currently are overly conservative with the major version increases |
FWIW @iajoiner also brought up the release recently on the mailing list schedule https://lists.apache.org/thread/v26dxfn4wx8q9slkb3f8pkmz0cggm1c3 |
A Parquet field with "Repeated" repetition and no "LIST" annotation are read as primitives instead of as list.
To reproduce: create a file with a top level field schema like:
REPEATED BYTE_ARRAY vals (UTF8);
and write lists of strings (i.e. with repetition levels of '0' and '1')
this should be read as a List of strings as specified in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types
Instead it is read as a field of single string values, where string comprising a logical list are instead read as distinct rows.
It is read correctly by pyarrow
The text was updated successfully, but these errors were encountered: