Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read multi-index column csv with index_col=False borks #6051

Closed
jreback opened this issue Jan 23, 2014 · 21 comments · Fixed by #30327
Closed

BUG: read multi-index column csv with index_col=False borks #6051

jreback opened this issue Jan 23, 2014 · 21 comments · Fixed by #30327
Labels
good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

http://stackoverflow.com/questions/21318865/read-multi-index-on-the-columns-from-csv-file

@hayd
Copy link
Contributor

hayd commented Jan 23, 2014

For convenience, here's test case:

from StringIO import StringIO
s1 = '''Male, Male, Male, Female, Female
R, R, L, R, R
.86, .67, .88, .78, .81'''

s2 = '''Male, Male, Male, Female, Female
R, R, L, R, R
.86, .67, .88, .78, .81
.86, .67, .88, .78, .82'''

In [11]: pd.read_csv(StringIO(s1), header=[0, 1])
Out[11]: 
Empty DataFrame
Columns: [(Male, R), ( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

In [12]: pd.read_csv(StringIO(s2), header=[0, 1])
Out[12]: 
   (Male, R)  ( Male,  R)  ( Male,  L)  ( Female,  R)  ( Female,  R)
0       0.86         0.67         0.88           0.82           0.82

seems to skip first row after header.

Note: columns tuplized as wanted to see if this was also a bug in 0.12.

@jreback
Copy link
Contributor Author

jreback commented Jan 23, 2014

@hayd could try tupleize_cols=True and and see if it works

@TomAugspurger
Copy link
Contributor

Mine is skipping all the subsequent rows (s1 and s2 are from Andy's example):

In [5]: pd.read_csv(StringIO(s1), header=[0, 1])
Out[5]: 
Empty DataFrame
Columns: [(Male, R), ( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

[0 rows x 5 columns]

In [6]: pd.read_csv(StringIO(s2), header=[0, 1])
Out[6]: 
Empty DataFrame
Columns: [(Male, R), ( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

[0 rows x 5 columns]

@hayd
Copy link
Contributor

hayd commented Jan 23, 2014

@jreback maybe was just being thick about tuplize columns (forgot repr of mi) is working fine, OT though.

There is a change in 0.12 and 0.13. I see what @TomAugspurger sees in 0.13.

@jreback
Copy link
Contributor Author

jreback commented Jan 23, 2014

The problem is that it is confused by the lack of an index_col I think; specify index_col=0 actually works (but kills the first value....)

@hayd
Copy link
Contributor

hayd commented Jan 23, 2014

Seems like the column after the header is being used for the naming of the index?

In [11]: pd.read_csv(StringIO(s1), header=[0, 1], index_col=0)
Out[11]: 
Empty DataFrame
Columns: [( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

In [12]: pd.read_csv(StringIO(s2), header=[0, 1], index_col=0)
Out[12]: 
      ( Male,  R)  ( Male,  L)  ( Female,  R)  ( Female,  R)
.86                                                         
0.86         0.67         0.88           0.82           0.82

@jreback
Copy link
Contributor Author

jreback commented Jan 23, 2014

@hayd yes if it can, but this is where the index_col matters, it is a heuristic (and maybe wrong in this case)

@waitingkuo
Copy link
Contributor

Seems the problem is caused by the duplicated columns ( Female, R). If you modify the second row to a, b, c, d, e, the function works normally. Is it a bug? Or should we throw some exception while there're duplicated multi-columns?

@jreback
Copy link
Contributor Author

jreback commented Jan 24, 2014

@waitingkuo hmm a duplicated multi index is technically valid (prob not tested very well though)
I think this my be related to index_col - it basically has to try to guess if their are names present or not

want to dig in?

@waitingkuo
Copy link
Contributor

For duplicated single column, some sequence numbers would be append:

In [13]: pd.read_csv(StringIO('R,R,L,R,R\n1,2,3,4,5'))
Out[13]: 
   R  R.1  L  R.2  R.3
0  1    2  3    4    5

[1 rows x 5 columns]

According to this logic, the multi-column one

Male,Male,Male,Female,Female
R,R,L,R,R

should be converted to

Male,Male,Male,Female,Female
R,R.1,L,R,R.1

Does it make sense?

@hayd
Copy link
Contributor

hayd commented Jan 28, 2014

That looks correct. However there is also a flag for this, mangle_dupe_cols:

In [7]: pd.read_csv(StringIO('R,R,L,R,R\n1,2,3,4,5'), mangle_dupe_cols=False)
Out[7]: 
   R  R  L  R  R
0  5  5  3  5  5

[1 rows x 5 columns]

@hayd
Copy link
Contributor

hayd commented Jan 28, 2014

Well... er that's a bug!

@waitingkuo
Copy link
Contributor

Things also go wrong when we set header as a list

In [4]: pd.read_csv(StringIO('R,R,L,R,R\n1,2,3,4,5'), header=[0])
Out[4]:  
Empty DataFrame
Columns: [R, R, L, R, R]
Index: []

[0 rows x 5 columns]

@waitingkuo
Copy link
Contributor

I've figured out the problem and fixed it in python2. However, I got stuck in python3. Can anyone who have experience in python3 give me a hand?

My commit
waitingkuo@b969e96

My Travis Failed build
https://travis-ci.org/waitingkuo/pandas/jobs/17788195

@jreback
Copy link
Contributor Author

jreback commented Jan 29, 2014

use lzip instead of zip
it's imported from pandas.compat

@waitingkuo
Copy link
Contributor

Thank you for helping :)
I've made the pull request

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014
@jreback jreback modified the milestones: 0.14.1, 0.15.0 May 30, 2014
@jreback jreback modified the milestones: 0.15.0, 0.14.1, 0.15.1 Jun 30, 2014
@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 4, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback removed this from the 0.16.0 milestone Mar 6, 2015
@Licht-T
Copy link
Contributor

Licht-T commented Nov 9, 2017

@jreback @gfyoung This seems fixed. We need to close this issue.

@jreback
Copy link
Contributor Author

jreback commented Nov 9, 2017

are there tests covering this case? if not can u put one up

@Licht-T
Copy link
Contributor

Licht-T commented Nov 9, 2017

@jreback Okay. I'll check.

@Licht-T
Copy link
Contributor

Licht-T commented Nov 10, 2017

@jreback Seems that #17060 fixed this bug. But there is no test for multi-index columns.

@jreback
Copy link
Contributor Author

jreback commented Nov 10, 2017

cc @gfyoung

@mroeschke mroeschke added Testing pandas testing functions or related to the test suite good first issue and removed Bug labels Jul 6, 2018
@mroeschke mroeschke added Needs Tests Unit test(s) needed to prevent regressions and removed Testing pandas testing functions or related to the test suite labels Oct 6, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.0 Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
6 participants