Add Option To Coerce Map Type on Parquet Write #6213

ion-elgreco · 2024-08-08T22:09:51Z

Describe the bug
Creating a recordbatch with arrow map types will have different field names then parquet spec wants. When you write a parquet with datafusion, the parquet spec is simply ignored and the data is written as-is with the same field names in the parquet. Which violates the parquet spec.

The parquet file has this schema:

<pyarrow._parquet.ParquetSchema object at 0x7f5393f683c0>
required group field_id=-1 arrow_schema {
  optional group field_id=-1 map_type (Map) {
    repeated group field_id=-1 entries {
      required binary field_id=-1 key (String);
      optional binary field_id=-1 value;
    }
  }
}

instead of

<pyarrow._parquet.ParquetSchema object at 0x7f5393f9cd40>
required group field_id=-1 arrow_schema {
  optional group field_id=-1 map_type (Map) {
    repeated group field_id=-1 key_value {
      required binary field_id=-1 key (String);
      optional binary field_id=-1 value;
    }
  }
}

Pyarrow parquet writer doesn't do this, and follows the parquet spec when writing. See here:

import pyarrow as pa
import pyarrow.parquet as pq

pylist = [{"map_type":{'1':b"M"}}]
schema = pa.schema(
    [
        pa.field("map_type", pa.map_(pa.large_string(), pa.large_binary())),
    ]
)
table = pa.Table.from_pylist(pylist, schema=schema)

# table.schema
#
# map_type: map<large_string, large_binary>
#   child 0, entries: struct<key: large_string not null, value: large_binary> not null
#       child 0, key: large_string not null
#       child 1, value: large_binary

pq.write_table(table, "test.parquet")
pq.read_metadata("test.parquet").schema

<pyarrow._parquet.ParquetSchema object at 0x7f53cc5116c0>
required group field_id=-1 schema {
  optional group field_id=-1 map_type (Map) {
    repeated group field_id=-1 key_value {
      required binary field_id=-1 key (String);
      optional binary field_id=-1 value;
    }
  }
}

You can see entries got written as key_value properly. Also interesting to note PyArrow uses "key","value", arrow-rs uses "keys","values",

The text was updated successfully, but these errors were encountered:

alamb · 2024-08-09T16:19:09Z

I marked this as an enhancement (rather than a bug) but the distinction is likely not all that useful

It would be great to have the ArrowWriter / Reader follow the same convention as pyarrow when reading/writing maps (or the standard if there is a standard that address this particular point)

tustvold · 2024-11-29T11:46:02Z

This boils down to the same issue as #6733 namely that arrow has different naming conventions to parquet. As stated on that linked ticket the first step would be to add an option to coerce the schema on write, once that is added we can have discussions about changing this default, but it must remain possible to keep the current behaviour.

ion-elgreco added the bug label Aug 8, 2024

ion-elgreco changed the title ~~Why does datafusion/arrow-rs write Map Arrow types as-is without taking Parquet spec into account?~~ Arrow/parquet-rs writes map types as-is without taking Parquet spec into account Aug 8, 2024

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Aug 9, 2024

alamb changed the title ~~Arrow/parquet-rs writes map types as-is without taking Parquet spec into account~~ arrow/parquet-rs writes MapArray types as-is without taking Parquet spec into account Aug 9, 2024

Morgan279 mentioned this issue Nov 29, 2024

change map field default names to follow the parquet format spec #6814

Closed

tustvold changed the title ~~arrow/parquet-rs writes MapArray types as-is without taking Parquet spec into account~~ Add Option To Coerce Map Type on Parquet Write Nov 29, 2024

etseidl mentioned this issue Dec 3, 2024

Optionally coerce names of maps and lists to match Parquet specification #6828

Merged

alamb closed this as completed in #6828 Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Option To Coerce Map Type on Parquet Write #6213

Add Option To Coerce Map Type on Parquet Write #6213

ion-elgreco commented Aug 8, 2024

alamb commented Aug 9, 2024

tustvold commented Nov 29, 2024

Add Option To Coerce Map Type on Parquet Write #6213

Add Option To Coerce Map Type on Parquet Write #6213

Comments

ion-elgreco commented Aug 8, 2024

alamb commented Aug 9, 2024

tustvold commented Nov 29, 2024