Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of duplicate columns in pandas.io.sql.read_frame #2738

Closed
eingerman opened this issue Jan 23, 2013 · 9 comments
Closed

Handling of duplicate columns in pandas.io.sql.read_frame #2738

eingerman opened this issue Jan 23, 2013 · 9 comments
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@eingerman
Copy link

Calling pandas.io.sql.read_frame can results in data frame with duplicate column names. For example when SQL query contains joins on tables with duplicate columns.

Data frames with duplicate column names cause errors in many pandas functions. I can't even rename columns as df.columns = new_columns generates errors.

I think correct behavior would be for pandas.io.sql.read_frame have an option to "deduplicate" column names (for example by adding a number) or generate an error with duplicate column names.

@ghost
Copy link

ghost commented Jan 23, 2013

There were a lot of dupe col bugs fixed in 9.1 (or 9.0, can't remember), so make sure you're using
the latest version.

as for df.columns = new_columns not working, I get this with with git master:

In [19]: df=mkdf(2,4)

In [20]: df
Out[20]: 
C0      C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
R0                                     
R_l0_g0    R0C0    R0C1    R0C2    R0C3
R_l0_g1    R1C0    R1C1    R1C2    R1C3

In [21]: df.columns=["a","a","b","c"]

In [22]: df
Out[22]: 
            a     a     b     c
R0                             
R_l0_g0  R0C0  R0C1  R0C2  R0C3
R_l0_g1  R1C0  R1C1  R1C2  R1C3

In [23]: df.columns=["a","d","b","c"]

In [24]: df
Out[24]: 
            a     d     b     c
R0                             
R_l0_g0  R0C0  R0C1  R0C2  R0C3
R_l0_g1  R1C0  R1C1  R1C2  R1C3

if you're not getting the same behaviour on a recent version, please open an issue with
steps to reproduce and it'll be looked into.

@hayd
Copy link
Contributor

hayd commented Jul 8, 2013

Is this now fixed? (at least the behaviour after it's been read in via sql)?

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

dups are pretty good in master now....

@hayd
Copy link
Contributor

hayd commented Jul 8, 2013

also, not sure it makes sense (in general) to dedupe pre-pandas...

@hayd hayd closed this as completed Jul 8, 2013
@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

maybe add an option to the sql engine like mangle_dup_columns (like on read_csv), so dupped would be dupped (or like X.1, X.2)...etc... but definitly shouldn't de-dup before pandas

@hayd hayd reopened this Jul 8, 2013
@hayd hayd mentioned this issue Jul 8, 2013
20 tasks
@hayd
Copy link
Contributor

hayd commented Jul 8, 2013

Is there already a method to do that, which is just be applied after reading?
Seems pretty trivial.

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

I would close this and just add to the master list...(but lower down)

@hayd
Copy link
Contributor

hayd commented Jul 8, 2013

already added

@ghost
Copy link

ghost commented Jan 24, 2014

@hayd , I'm pushing to coalesce the bits and pieces of SQL around @mangecoeur recent work
in #5950. #3163 already mentions this, and we should rework the outsanding issues there
as a continuation issue for #5950.

closing.

@ghost ghost closed this as completed Jan 24, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

3 participants