You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The current Parquet reader depends heavily on the deprecated ConvertedType enum, as well as the associated (and also deprecated) scale and precision fields in the SchemaElement struct. Also, handling of some data types (such as string to category transformations performed on BYTE_ARRAY columns) relies on magic numbers rather than flags (see here and here for examples). Further, due to the fact that the ConvertedType enum is no longer updated, the handling of certain types (e.g. nanosecond timestamps) currently examines both the converted type and the logical type, where simply using the logical type would suffice. New types added to specification, such as UUID and FLOAT16 also require use of a logical type annotation for FIXED_LEN_BYTE_ARRAY encoded data.
Describe the solution you'd like
I would like for the Parquet decoder kernels to exclusively use Parquet physical and logical type data rather than a hodgepodge of physical, converted, logical types and magic numbers. There are well defined mappings of ConvertedTypes to LogicalTypes, so older files that lack LogicalType info can still be handled.
Describe alternatives you've considered
We could make use of LogicalType info for new types only and keep the old ConvertedType logic.
Additional context
Implementing this change will require a great deal of care and testing to make sure there are no user visible changes. For instance, the introduction of LogicalType had unintended consequences due to inconsistent prior handling of UTC conversion (#14322).
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
The current Parquet reader depends heavily on the deprecated
ConvertedType
enum, as well as the associated (and also deprecated)scale
andprecision
fields in theSchemaElement
struct. Also, handling of some data types (such as string to category transformations performed onBYTE_ARRAY
columns) relies on magic numbers rather than flags (see here and here for examples). Further, due to the fact that theConvertedType
enum is no longer updated, the handling of certain types (e.g. nanosecond timestamps) currently examines both the converted type and the logical type, where simply using the logical type would suffice. New types added to specification, such asUUID
andFLOAT16
also require use of a logical type annotation forFIXED_LEN_BYTE_ARRAY
encoded data.Describe the solution you'd like
I would like for the Parquet decoder kernels to exclusively use Parquet physical and logical type data rather than a hodgepodge of physical, converted, logical types and magic numbers. There are well defined mappings of
ConvertedType
s toLogicalType
s, so older files that lackLogicalType
info can still be handled.Describe alternatives you've considered
We could make use of
LogicalType
info for new types only and keep the oldConvertedType
logic.Additional context
Implementing this change will require a great deal of care and testing to make sure there are no user visible changes. For instance, the introduction of
LogicalType
had unintended consequences due to inconsistent prior handling of UTC conversion (#14322).The text was updated successfully, but these errors were encountered: