Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index_col in read_csv and read_table ignores dtype argument #9435

Closed
makmanalp opened this issue Feb 6, 2015 · 17 comments · Fixed by #44632
Closed

index_col in read_csv and read_table ignores dtype argument #9435

makmanalp opened this issue Feb 6, 2015 · 17 comments · Fixed by #44632
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@makmanalp
Copy link
Contributor

makmanalp commented Feb 6, 2015

xref #11728 for the multi-index case
xref #14379 for converters

import pandas as pd
import numpy as np

from pandas.compat import StringIO

data = """Internets,Spaceships
01,a
02,b
03,c
04,d
05,e
06,f
"""

# No leading zeroes in the index because it interprets the column as numeric
print pd.read_csv(StringIO.StringIO(data), index_col="Internets")

# Expected to see trailing zeroes in the index still, got 1,2,3,4 instead. Index is int64.
print pd.read_csv(StringIO.StringIO(data), index_col="Internets", dtype={"Internets": np.object})

# Trailing zeroes now, index is object.
print pd.read_csv(StringIO.StringIO(data), dtype={"Internets": np.object}).set_index("Internets")

Version:

In [1]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: None
numpy: 1.9.1
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 2.1.0
sphinx: None
patsy: 0.3.0
dateutil: 2.4.0
pytz: 2011c
bottleneck: None
tables: None
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 2.0.4
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.6
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Feb 6, 2015

iIIRC this is already fixed in master
can u give a try

@makmanalp
Copy link
Contributor Author

@jreback Still seems to happen in version 0.15.2-163-g671b384 commit 671b384 (from today). :( Sorry!

I'm guessing the reason for this is that the index assignment happens separately from the dtype conversion somehow - I dug around in parsers.py trying to see if I could find the root but no luck.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2015

The values that are passed to the index constructor are object (e.g. the 01,02....etc), but the Index constructor will coerce these if possible (and here it IS possible), unless a dtype is passed. So to fix you need to pass the dtype if it is passed in the dtype spec to read_csv. want to do a pull-request?

@jreback jreback added Bug IO CSV read_csv, to_csv labels Feb 11, 2015
@jreback jreback added this to the 0.17.0 milestone Feb 11, 2015
@makmanalp
Copy link
Contributor Author

My first pandas pull request! I'll give it a shot tomorrow or over the long weekend. Thanks!

Notes to self:

https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L721
https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L126
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L246
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L558

  • What is dtypes / vs converters ?
  • Stick this in cparserwrapper / python parser or higher up?

@Eastsun
Copy link

Eastsun commented Nov 7, 2015

This bug still existed in 0.17.

@jreback
Copy link
Contributor

jreback commented Nov 7, 2015

@Eastsun doesnt the open tag on the issue make this clear?

@johanneshk
Copy link

Is there also a quick workaround for the following case? I write multi-indexed columns to csv and then read them as follows:

import pandas as pd
from io import StringIO

data=""",col11,col21
,col12,col22
uuid,,
0001,1,1
0002,2,2
"""

print(pd.read_csv(StringIO(data), header = [0,1], index_col=0, dtype={'uuid':object))

Omitting the index_col messes up the columns. Rebuilding the correct structure is a bit of a pain...

@jreback
Copy link
Contributor

jreback commented Nov 16, 2015

not sure how 'quick' this is....

In [35]: result = pd.read_csv(StringIO.StringIO(data),dtype={'uuid':object},skiprows=2).set_index('uuid')

In [36]: result.columns = pd.MultiIndex.from_tuples(pd.read_csv(StringIO.StringIO(data),header=[0,1],index_col=0,nrows=0).columns)

In [37]: result
Out[37]: 
     col11 col21
     col12 col22
uuid            
0001     1     1
0002     2     2

@gfyoung
Copy link
Member

gfyoung commented Aug 26, 2016

Nice minimal example here:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a,b\n01,2'
>>> read_csv(StringIO(data), index_col='a', dtype={'a': object})
   b
a   
1  2

The reason for this is because we have this awkward double conversion for indices. First, the column will be converted here. After this conversion, the column is correctly outputted. However, a second conversion will take place here, and it is this conversion that screws up everything.

The reason for this awkward double conversion is because for some reason or another, some functionality does depend on that second conversion. For example, if a converter is passed in for the column, the first conversion step is skipped, and the second one is relied on.

This is yet another indication of the much needed refactoring in the read_csv world.

cduvallet added a commit to cduvallet/dbotu3 that referenced this issue Nov 22, 2016
When sequence IDs are numbers, pandas reads index as int64 dtype (even
if you specify dtype - index_col=0 overrides specification and index is read
as int64 no matter what). Open issue: pandas-dev/pandas#9435
@gfairchild
Copy link

Note that this bug also affects read_fwf.

alubbock added a commit to alubbock/thunor that referenced this issue Aug 25, 2017
Down to this still open at time of writing 2.5 year old bug:
pandas-dev/pandas#9435
alubbock added a commit to alubbock/thunor-web that referenced this issue Aug 25, 2017
Down to this pandas bug, 2.5 years old, still open:
pandas-dev/pandas#9435
@smcinerney
Copy link

Just checking, are both of these issues still open on 0.20+? Is makmanalp's resolved but johanneshk's case still open (the latter is still wrong on 0.20.3). Do we perhaps need to open a separate issue?

Also, in Python 3, import StringIO -> import io ... and instantiate io.StringIO

@grayfall
Copy link

@smcinerney the OP's (@makmanalp) hasn't been addressed as of 0.21.1.

@gwerbin
Copy link

gwerbin commented Oct 8, 2018

Still open?

On 0.23.4:

import io
pd.read_csv(io.StringIO('''
version,downloads
1.1,100
1.2,1000
1.3,10000
'''), dtype={'version': str}, index_col='version').index.dtype
# dtype('float64')

@gfyoung
Copy link
Member

gfyoung commented Oct 8, 2018

@gwerbin : Indeed, this is an open issue.

@gwerbin
Copy link

gwerbin commented Oct 8, 2018

I see.

The reason for this awkward double conversion is because for some reason or another, some functionality does depend on that second conversion. For example, if a converter is passed in for the column, the first conversion step is skipped, and the second one is relied on.

That bites. Is test coverage good enough in this area that this can be fixed without a big refactor?

@gfyoung
Copy link
Member

gfyoung commented Oct 8, 2018

Is test coverage good enough in this area that this can be fixed without a big refactor?

It should be. The problem is that there are a bunch of failures if you try to "correct" the behavior.

sacdallago pushed a commit to sacdallago/bio_embeddings that referenced this issue Aug 20, 2020
This was caused by pandas loading the index column as int

See https://stackoverflow.com/questions/29792865/how-to-specify-the-dtype-of-index-when-read-a-csv-file-to-dataframe and pandas-dev/pandas#9435 for the workaround used against that

Fixes GH-50
@RaverJay
Copy link

RaverJay commented Mar 8, 2021

Ran into this today, seems still open.

Any preferred workaround in the meantime?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.