You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Creating a recordbatch with arrow map types will have different field names then parquet spec wants. When you write a parquet with datafusion, the parquet spec is simply ignored and the data is written as-is with the same field names in the parquet. Which violates the parquet spec.
The parquet file has this schema:
<pyarrow._parquet.ParquetSchema object at 0x7f5393f683c0>
required group field_id=-1 arrow_schema {
optional group field_id=-1 map_type (Map) {
repeated group field_id=-1 entries {
required binary field_id=-1 key (String);
optional binary field_id=-1 value;
}
}
}
instead of
<pyarrow._parquet.ParquetSchema object at 0x7f5393f9cd40>
required group field_id=-1 arrow_schema {
optional group field_id=-1 map_type (Map) {
repeated group field_id=-1 key_value {
required binary field_id=-1 key (String);
optional binary field_id=-1 value;
}
}
}
Pyarrow parquet writer doesn't do this, and follows the parquet spec when writing. See here:
ion-elgreco
changed the title
Why does datafusion/arrow-rs write Map Arrow types as-is without taking Parquet spec into account?
Arrow/parquet-rs writes map types as-is without taking Parquet spec into account
Aug 8, 2024
alamb
changed the title
Arrow/parquet-rs writes map types as-is without taking Parquet spec into account
arrow/parquet-rs writes MapArray types as-is without taking Parquet spec into account
Aug 9, 2024
I marked this as an enhancement (rather than a bug) but the distinction is likely not all that useful
It would be great to have the ArrowWriter / Reader follow the same convention as pyarrow when reading/writing maps (or the standard if there is a standard that address this particular point)
This boils down to the same issue as #6733 namely that arrow has different naming conventions to parquet. As stated on that linked ticket the first step would be to add an option to coerce the schema on write, once that is added we can have discussions about changing this default, but it must remain possible to keep the current behaviour.
tustvold
changed the title
arrow/parquet-rs writes MapArray types as-is without taking Parquet spec into account
Add Option To Coerce Map Type on Parquet Write
Nov 29, 2024
Describe the bug
Creating a recordbatch with arrow map types will have different field names then parquet spec wants. When you write a parquet with datafusion, the parquet spec is simply ignored and the data is written as-is with the same field names in the parquet. Which violates the parquet spec.
The parquet file has this schema:
instead of
Pyarrow parquet writer doesn't do this, and follows the parquet spec when writing. See here:
You can see entries got written as key_value properly. Also interesting to note PyArrow uses "key","value", arrow-rs uses "keys","values",
The text was updated successfully, but these errors were encountered: