Handling of duplicate columns in pandas.io.sql.read_frame #2738

eingerman · 2013-01-23T18:42:41Z

Calling pandas.io.sql.read_frame can results in data frame with duplicate column names. For example when SQL query contains joins on tables with duplicate columns.

Data frames with duplicate column names cause errors in many pandas functions. I can't even rename columns as df.columns = new_columns generates errors.

I think correct behavior would be for pandas.io.sql.read_frame have an option to "deduplicate" column names (for example by adding a number) or generate an error with duplicate column names.

ghost · 2013-01-23T19:28:26Z

There were a lot of dupe col bugs fixed in 9.1 (or 9.0, can't remember), so make sure you're using
the latest version.

as for df.columns = new_columns not working, I get this with with git master:

In [19]: df=mkdf(2,4)

In [20]: df
Out[20]: 
C0      C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
R0                                     
R_l0_g0    R0C0    R0C1    R0C2    R0C3
R_l0_g1    R1C0    R1C1    R1C2    R1C3

In [21]: df.columns=["a","a","b","c"]

In [22]: df
Out[22]: 
            a     a     b     c
R0                             
R_l0_g0  R0C0  R0C1  R0C2  R0C3
R_l0_g1  R1C0  R1C1  R1C2  R1C3

In [23]: df.columns=["a","d","b","c"]

In [24]: df
Out[24]: 
            a     d     b     c
R0                             
R_l0_g0  R0C0  R0C1  R0C2  R0C3
R_l0_g1  R1C0  R1C1  R1C2  R1C3

if you're not getting the same behaviour on a recent version, please open an issue with
steps to reproduce and it'll be looked into.

hayd · 2013-07-08T20:18:15Z

Is this now fixed? (at least the behaviour after it's been read in via sql)?

jreback · 2013-07-08T20:19:24Z

dups are pretty good in master now....

hayd · 2013-07-08T20:25:11Z

also, not sure it makes sense (in general) to dedupe pre-pandas...

jreback · 2013-07-08T20:26:51Z

maybe add an option to the sql engine like mangle_dup_columns (like on read_csv), so dupped would be dupped (or like X.1, X.2)...etc... but definitly shouldn't de-dup before pandas

hayd · 2013-07-08T20:32:46Z

Is there already a method to do that, which is just be applied after reading?
Seems pretty trivial.

jreback · 2013-07-08T20:42:05Z

I would close this and just add to the master list...(but lower down)

hayd · 2013-07-08T20:46:31Z

already added

ghost · 2014-01-24T14:35:14Z

@hayd , I'm pushing to coalesce the bits and pieces of SQL around @mangecoeur recent work
in #5950. #3163 already mentions this, and we should rework the outsanding issues there
as a continuation issue for #5950.

closing.

hayd closed this as completed Jul 8, 2013

hayd reopened this Jul 8, 2013

hayd mentioned this issue Jul 8, 2013

ENH: sql support #4163

Closed

20 tasks

ghost closed this as completed Jan 24, 2014

jreback mentioned this issue Apr 4, 2014

ENH: SQL Enhancement for the Future #6701

Closed

6 tasks

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of duplicate columns in pandas.io.sql.read_frame #2738

Handling of duplicate columns in pandas.io.sql.read_frame #2738

eingerman commented Jan 23, 2013

ghost commented Jan 23, 2013

hayd commented Jul 8, 2013

jreback commented Jul 8, 2013

hayd commented Jul 8, 2013

jreback commented Jul 8, 2013

hayd commented Jul 8, 2013

jreback commented Jul 8, 2013

hayd commented Jul 8, 2013

ghost commented Jan 24, 2014

Handling of duplicate columns in pandas.io.sql.read_frame #2738

Handling of duplicate columns in pandas.io.sql.read_frame #2738

Comments

eingerman commented Jan 23, 2013

ghost commented Jan 23, 2013

hayd commented Jul 8, 2013

jreback commented Jul 8, 2013

hayd commented Jul 8, 2013

jreback commented Jul 8, 2013

hayd commented Jul 8, 2013

jreback commented Jul 8, 2013

hayd commented Jul 8, 2013

ghost commented Jan 24, 2014