[FEA] Parquet reader should use LogicalType rather than ConvertedType #15224

etseidl · 2024-03-04T18:48:55Z

Is your feature request related to a problem? Please describe.
The current Parquet reader depends heavily on the deprecated ConvertedType enum, as well as the associated (and also deprecated) scale and precision fields in the SchemaElement struct. Also, handling of some data types (such as string to category transformations performed on BYTE_ARRAY columns) relies on magic numbers rather than flags (see here and here for examples). Further, due to the fact that the ConvertedType enum is no longer updated, the handling of certain types (e.g. nanosecond timestamps) currently examines both the converted type and the logical type, where simply using the logical type would suffice. New types added to specification, such as UUID and FLOAT16 also require use of a logical type annotation for FIXED_LEN_BYTE_ARRAY encoded data.

Describe the solution you'd like
I would like for the Parquet decoder kernels to exclusively use Parquet physical and logical type data rather than a hodgepodge of physical, converted, logical types and magic numbers. There are well defined mappings of ConvertedTypes to LogicalTypes, so older files that lack LogicalType info can still be handled.

Describe alternatives you've considered
We could make use of LogicalType info for new types only and keep the old ConvertedType logic.

Additional context
Implementing this change will require a great deal of care and testing to make sure there are no user visible changes. For instance, the introduction of LogicalType had unintended consequences due to inconsistent prior handling of UTC conversion (#14322).

The text was updated successfully, but these errors were encountered:

Closes #15224. Now use logical type exclusively in the reader rather than the deprecated converted type. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) URL: #15365

etseidl added the feature request New feature or request label Mar 4, 2024

GregoryKimball added this to libcudf Mar 6, 2024

GregoryKimball added this to the Parquet continuous improvement milestone Mar 6, 2024

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Mar 6, 2024

GregoryKimball moved this to Needs owner in libcudf Mar 6, 2024

etseidl mentioned this issue Mar 21, 2024

Use logical types in Parquet reader #15365

Merged

3 tasks

mhaseeb123 mentioned this issue Mar 22, 2024

[BUG] Unable to write timedelta64[s] type correctly with parquet writer #13409

Closed

rapids-bot bot closed this as completed in #15365 Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Parquet reader should use LogicalType rather than ConvertedType #15224

[FEA] Parquet reader should use LogicalType rather than ConvertedType #15224

etseidl commented Mar 4, 2024 •

edited

Loading

[FEA] Parquet reader should use LogicalType rather than ConvertedType #15224

[FEA] Parquet reader should use LogicalType rather than ConvertedType #15224

Comments

etseidl commented Mar 4, 2024 • edited Loading

etseidl commented Mar 4, 2024 •

edited

Loading