BUG: read multi-index column csv with index_col=False borks #6051

jreback · 2014-01-23T20:57:37Z

http://stackoverflow.com/questions/21318865/read-multi-index-on-the-columns-from-csv-file

hayd · 2014-01-23T21:02:25Z

For convenience, here's test case:

from StringIO import StringIO
s1 = '''Male, Male, Male, Female, Female
R, R, L, R, R
.86, .67, .88, .78, .81'''

s2 = '''Male, Male, Male, Female, Female
R, R, L, R, R
.86, .67, .88, .78, .81
.86, .67, .88, .78, .82'''

In [11]: pd.read_csv(StringIO(s1), header=[0, 1])
Out[11]: 
Empty DataFrame
Columns: [(Male, R), ( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

In [12]: pd.read_csv(StringIO(s2), header=[0, 1])
Out[12]: 
   (Male, R)  ( Male,  R)  ( Male,  L)  ( Female,  R)  ( Female,  R)
0       0.86         0.67         0.88           0.82           0.82

seems to skip first row after header.

Note: columns tuplized as wanted to see if this was also a bug in 0.12.

jreback · 2014-01-23T21:04:00Z

@hayd could try tupleize_cols=True and and see if it works

TomAugspurger · 2014-01-23T21:06:08Z

Mine is skipping all the subsequent rows (s1 and s2 are from Andy's example):

In [5]: pd.read_csv(StringIO(s1), header=[0, 1])
Out[5]: 
Empty DataFrame
Columns: [(Male, R), ( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

[0 rows x 5 columns]

In [6]: pd.read_csv(StringIO(s2), header=[0, 1])
Out[6]: 
Empty DataFrame
Columns: [(Male, R), ( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

[0 rows x 5 columns]

hayd · 2014-01-23T21:10:12Z

@jreback maybe was just being thick about tuplize columns (forgot repr of mi) is working fine, OT though.

There is a change in 0.12 and 0.13. I see what @TomAugspurger sees in 0.13.

jreback · 2014-01-23T21:14:56Z

The problem is that it is confused by the lack of an index_col I think; specify index_col=0 actually works (but kills the first value....)

hayd · 2014-01-23T21:20:54Z

Seems like the column after the header is being used for the naming of the index?

In [11]: pd.read_csv(StringIO(s1), header=[0, 1], index_col=0)
Out[11]: 
Empty DataFrame
Columns: [( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

In [12]: pd.read_csv(StringIO(s2), header=[0, 1], index_col=0)
Out[12]: 
      ( Male,  R)  ( Male,  L)  ( Female,  R)  ( Female,  R)
.86                                                         
0.86         0.67         0.88           0.82           0.82

jreback · 2014-01-23T21:30:33Z

@hayd yes if it can, but this is where the index_col matters, it is a heuristic (and maybe wrong in this case)

waitingkuo · 2014-01-24T04:08:11Z

Seems the problem is caused by the duplicated columns ( Female, R). If you modify the second row to a, b, c, d, e, the function works normally. Is it a bug? Or should we throw some exception while there're duplicated multi-columns?

jreback · 2014-01-24T11:54:46Z

@waitingkuo hmm a duplicated multi index is technically valid (prob not tested very well though)
I think this my be related to index_col - it basically has to try to guess if their are names present or not

want to dig in?

waitingkuo · 2014-01-28T05:12:59Z

For duplicated single column, some sequence numbers would be append:

In [13]: pd.read_csv(StringIO('R,R,L,R,R\n1,2,3,4,5'))
Out[13]: 
   R  R.1  L  R.2  R.3
0  1    2  3    4    5

[1 rows x 5 columns]

According to this logic, the multi-column one

Male,Male,Male,Female,Female
R,R,L,R,R

should be converted to

Male,Male,Male,Female,Female
R,R.1,L,R,R.1

Does it make sense?

hayd · 2014-01-28T05:40:39Z

That looks correct. However there is also a flag for this, mangle_dupe_cols:

In [7]: pd.read_csv(StringIO('R,R,L,R,R\n1,2,3,4,5'), mangle_dupe_cols=False)
Out[7]: 
   R  R  L  R  R
0  5  5  3  5  5

[1 rows x 5 columns]

hayd · 2014-01-28T05:41:21Z

Well... er that's a bug!

waitingkuo · 2014-01-28T06:27:58Z

Things also go wrong when we set header as a list

In [4]: pd.read_csv(StringIO('R,R,L,R,R\n1,2,3,4,5'), header=[0])
Out[4]:  
Empty DataFrame
Columns: [R, R, L, R, R]
Index: []

[0 rows x 5 columns]

waitingkuo · 2014-01-29T11:45:37Z

I've figured out the problem and fixed it in python2. However, I got stuck in python3. Can anyone who have experience in python3 give me a hand?

My commit
waitingkuo@b969e96

My Travis Failed build
https://travis-ci.org/waitingkuo/pandas/jobs/17788195

jreback · 2014-01-29T11:55:20Z

use lzip instead of zip
it's imported from pandas.compat

waitingkuo · 2014-01-29T15:51:49Z

Thank you for helping :)
I've made the pull request

Licht-T · 2017-11-09T14:22:50Z

@jreback @gfyoung This seems fixed. We need to close this issue.

jreback · 2017-11-09T14:31:39Z

are there tests covering this case? if not can u put one up

Licht-T · 2017-11-09T14:34:22Z

@jreback Okay. I'll check.

Licht-T · 2017-11-10T13:24:23Z

@jreback Seems that #17060 fixed this bug. But there is no test for multi-index columns.

jreback · 2017-11-10T14:30:42Z

cc @gfyoung

waitingkuo mentioned this issue Jan 29, 2014

BUG: parsing multi-column headers in read_csv (GH6051) #6170

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014

jreback modified the milestones: 0.14.1, 0.15.0 May 30, 2014

jreback modified the milestones: 0.15.0, 0.14.1, 0.15.1 Jun 30, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Sep 4, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback removed this from the 0.16.0 milestone Mar 6, 2015

mroeschke added Testing pandas testing functions or related to the test suite good first issue and removed Bug labels Jul 6, 2018

mroeschke added Needs Tests Unit test(s) needed to prevent regressions and removed Testing pandas testing functions or related to the test suite labels Oct 6, 2019

jbrockmendel mentioned this issue Dec 18, 2019

TST: tests for needs-test issues #12857 #12689 #30327

Merged

7 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Dec 20, 2019

jreback closed this as completed in #30327 Dec 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read multi-index column csv with index_col=False borks #6051

BUG: read multi-index column csv with index_col=False borks #6051

jreback commented Jan 23, 2014

hayd commented Jan 23, 2014

jreback commented Jan 23, 2014

TomAugspurger commented Jan 23, 2014

hayd commented Jan 23, 2014

jreback commented Jan 23, 2014

hayd commented Jan 23, 2014

jreback commented Jan 23, 2014

waitingkuo commented Jan 24, 2014

jreback commented Jan 24, 2014

waitingkuo commented Jan 28, 2014

hayd commented Jan 28, 2014

hayd commented Jan 28, 2014

waitingkuo commented Jan 28, 2014

waitingkuo commented Jan 29, 2014

jreback commented Jan 29, 2014

waitingkuo commented Jan 29, 2014

Licht-T commented Nov 9, 2017

jreback commented Nov 9, 2017

Licht-T commented Nov 9, 2017

Licht-T commented Nov 10, 2017

jreback commented Nov 10, 2017

BUG: read multi-index column csv with index_col=False borks #6051

BUG: read multi-index column csv with index_col=False borks #6051

Comments

jreback commented Jan 23, 2014

hayd commented Jan 23, 2014

jreback commented Jan 23, 2014

TomAugspurger commented Jan 23, 2014

hayd commented Jan 23, 2014

jreback commented Jan 23, 2014

hayd commented Jan 23, 2014

jreback commented Jan 23, 2014

waitingkuo commented Jan 24, 2014

jreback commented Jan 24, 2014

waitingkuo commented Jan 28, 2014

hayd commented Jan 28, 2014

hayd commented Jan 28, 2014

waitingkuo commented Jan 28, 2014

waitingkuo commented Jan 29, 2014

jreback commented Jan 29, 2014

waitingkuo commented Jan 29, 2014

Licht-T commented Nov 9, 2017

jreback commented Nov 9, 2017

Licht-T commented Nov 9, 2017

Licht-T commented Nov 10, 2017

jreback commented Nov 10, 2017