Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_arrow handles empty/duplicate column names badly #11632

Open
2 tasks done
stinodego opened this issue Oct 10, 2023 · 0 comments
Open
2 tasks done

from_arrow handles empty/duplicate column names badly #11632

stinodego opened this issue Oct 10, 2023 · 0 comments
Labels
A-interop-arrow Area: interoperability with other Arrow implementations (such as pyarrow) bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@stinodego
Copy link
Member

stinodego commented Oct 10, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import pyarrow as pa
import polars as pl

a1 = pa.array([1, 2])
a2 = pa.array([3, 4])


df = pl.from_arrow(pa.Table.from_arrays([a1, a2], names=["", ""]))
print(df.columns)  # ['column_0', 'column_1']

df = pl.from_arrow(pa.Table.from_arrays([a1, a2], names=["", "a"]))
print(df.columns)  # ['column_0', 'a']

df = pl.from_arrow(pa.Table.from_arrays([a1, a2], names=["a", "a"]))
print(df.columns)  # ['a']

Log output

No response

Issue description

pyarrow Tables allow duplicate column names, while we do not.

We handle this issue for empty columns by replacing them with column_0, column_1, etc.
For duplicate named columns, we simply drop any duplicates (only keeping the last one).

Expected behavior

I propose we raise an error on duplicate column names, with the suggestion to specify the schema argument. Then we can treat empty column names ("") the same as any other column name.

Installed versions

main branch, pyarrow 13.0.0

@stinodego stinodego added bug Something isn't working python Related to Python Polars labels Oct 10, 2023
@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@stinodego stinodego added A-interop-arrow Area: interoperability with other Arrow implementations (such as pyarrow) P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Feb 18, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Feb 18, 2024
wence- added a commit to wence-/cudf that referenced this issue Aug 19, 2024
polars.from_arrow renames empty column names (see
pola-rs/polars#11632). This is causes
problems when round-tripping specially crafted dataframes. Avoid the
problem by constructing the table with fake names and then renaming.
wence- added a commit to rapidsai/cudf that referenced this issue Aug 27, 2024
polars.from_arrow renames empty column names (see
pola-rs/polars#11632). This causes problems
when round-tripping specially crafted dataframes. Avoid the problem by
constructing the table with fake names and then renaming.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interop-arrow Area: interoperability with other Arrow implementations (such as pyarrow) bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Status: Ready
1 participant