Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Option To Coerce Map Type on Parquet Write #6213

Closed
ion-elgreco opened this issue Aug 8, 2024 · 2 comments · Fixed by #6828
Closed

Add Option To Coerce Map Type on Parquet Write #6213

ion-elgreco opened this issue Aug 8, 2024 · 2 comments · Fixed by #6828
Labels
bug enhancement Any new improvement worthy of a entry in the changelog

Comments

@ion-elgreco
Copy link

Describe the bug
Creating a recordbatch with arrow map types will have different field names then parquet spec wants. When you write a parquet with datafusion, the parquet spec is simply ignored and the data is written as-is with the same field names in the parquet. Which violates the parquet spec.

The parquet file has this schema:

<pyarrow._parquet.ParquetSchema object at 0x7f5393f683c0>
required group field_id=-1 arrow_schema {
  optional group field_id=-1 map_type (Map) {
    repeated group field_id=-1 entries {
      required binary field_id=-1 key (String);
      optional binary field_id=-1 value;
    }
  }
}

instead of

<pyarrow._parquet.ParquetSchema object at 0x7f5393f9cd40>
required group field_id=-1 arrow_schema {
  optional group field_id=-1 map_type (Map) {
    repeated group field_id=-1 key_value {
      required binary field_id=-1 key (String);
      optional binary field_id=-1 value;
    }
  }
}

Pyarrow parquet writer doesn't do this, and follows the parquet spec when writing. See here:

import pyarrow as pa
import pyarrow.parquet as pq

pylist = [{"map_type":{'1':b"M"}}]
schema = pa.schema(
    [
        pa.field("map_type", pa.map_(pa.large_string(), pa.large_binary())),
    ]
)
table = pa.Table.from_pylist(pylist, schema=schema)

# table.schema
#
# map_type: map<large_string, large_binary>
#   child 0, entries: struct<key: large_string not null, value: large_binary> not null
#       child 0, key: large_string not null
#       child 1, value: large_binary

pq.write_table(table, "test.parquet")
pq.read_metadata("test.parquet").schema

<pyarrow._parquet.ParquetSchema object at 0x7f53cc5116c0>
required group field_id=-1 schema {
  optional group field_id=-1 map_type (Map) {
    repeated group field_id=-1 key_value {
      required binary field_id=-1 key (String);
      optional binary field_id=-1 value;
    }
  }
}

You can see entries got written as key_value properly. Also interesting to note PyArrow uses "key","value", arrow-rs uses "keys","values",

@ion-elgreco ion-elgreco added the bug label Aug 8, 2024
@ion-elgreco ion-elgreco changed the title Why does datafusion/arrow-rs write Map Arrow types as-is without taking Parquet spec into account? Arrow/parquet-rs writes map types as-is without taking Parquet spec into account Aug 8, 2024
@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Aug 9, 2024
@alamb alamb changed the title Arrow/parquet-rs writes map types as-is without taking Parquet spec into account arrow/parquet-rs writes MapArray types as-is without taking Parquet spec into account Aug 9, 2024
@alamb
Copy link
Contributor

alamb commented Aug 9, 2024

I marked this as an enhancement (rather than a bug) but the distinction is likely not all that useful

It would be great to have the ArrowWriter / Reader follow the same convention as pyarrow when reading/writing maps (or the standard if there is a standard that address this particular point)

@tustvold
Copy link
Contributor

This boils down to the same issue as #6733 namely that arrow has different naming conventions to parquet. As stated on that linked ticket the first step would be to add an option to coerce the schema on write, once that is added we can have discussions about changing this default, but it must remain possible to keep the current behaviour.

@tustvold tustvold changed the title arrow/parquet-rs writes MapArray types as-is without taking Parquet spec into account Add Option To Coerce Map Type on Parquet Write Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
3 participants