index_col in read_csv and read_table ignores dtype argument #9435

makmanalp · 2015-02-06T16:35:48Z

xref #11728 for the multi-index case
xref #14379 for converters

import pandas as pd
import numpy as np

from pandas.compat import StringIO

data = """Internets,Spaceships
01,a
02,b
03,c
04,d
05,e
06,f
"""

# No leading zeroes in the index because it interprets the column as numeric
print pd.read_csv(StringIO.StringIO(data), index_col="Internets")

# Expected to see trailing zeroes in the index still, got 1,2,3,4 instead. Index is int64.
print pd.read_csv(StringIO.StringIO(data), index_col="Internets", dtype={"Internets": np.object})

# Trailing zeroes now, index is object.
print pd.read_csv(StringIO.StringIO(data), dtype={"Internets": np.object}).set_index("Internets")

Version:

In [1]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: None
numpy: 1.9.1
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 2.1.0
sphinx: None
patsy: 0.3.0
dateutil: 2.4.0
pytz: 2011c
bottleneck: None
tables: None
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 2.0.4
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.6
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

jreback · 2015-02-06T16:41:28Z

iIIRC this is already fixed in master
can u give a try

makmanalp · 2015-02-06T19:13:56Z

@jreback Still seems to happen in version 0.15.2-163-g671b384 commit 671b384 (from today). :( Sorry!

I'm guessing the reason for this is that the index assignment happens separately from the dtype conversion somehow - I dug around in parsers.py trying to see if I could find the root but no luck.

jreback · 2015-02-11T00:03:50Z

The values that are passed to the index constructor are object (e.g. the 01,02....etc), but the Index constructor will coerce these if possible (and here it IS possible), unless a dtype is passed. So to fix you need to pass the dtype if it is passed in the dtype spec to read_csv. want to do a pull-request?

makmanalp · 2015-02-12T19:12:50Z

My first pandas pull request! I'll give it a shot tomorrow or over the long weekend. Thanks!

Notes to self:

https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L721
https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L126
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L246
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L558

What is dtypes / vs converters ?
Stick this in cparserwrapper / python parser or higher up?

Eastsun · 2015-11-07T10:06:00Z

This bug still existed in 0.17.

jreback · 2015-11-07T11:12:18Z

@Eastsun doesnt the open tag on the issue make this clear?

johanneshk · 2015-11-16T15:04:05Z

Is there also a quick workaround for the following case? I write multi-indexed columns to csv and then read them as follows:

import pandas as pd
from io import StringIO

data=""",col11,col21
,col12,col22
uuid,,
0001,1,1
0002,2,2
"""

print(pd.read_csv(StringIO(data), header = [0,1], index_col=0, dtype={'uuid':object))

Omitting the index_col messes up the columns. Rebuilding the correct structure is a bit of a pain...

jreback · 2015-11-16T15:11:32Z

not sure how 'quick' this is....

In [35]: result = pd.read_csv(StringIO.StringIO(data),dtype={'uuid':object},skiprows=2).set_index('uuid')

In [36]: result.columns = pd.MultiIndex.from_tuples(pd.read_csv(StringIO.StringIO(data),header=[0,1],index_col=0,nrows=0).columns)

In [37]: result
Out[37]: 
     col11 col21
     col12 col22
uuid            
0001     1     1
0002     2     2

gfyoung · 2016-08-26T08:20:39Z

Nice minimal example here:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a,b\n01,2'
>>> read_csv(StringIO(data), index_col='a', dtype={'a': object})
   b
a   
1  2

The reason for this is because we have this awkward double conversion for indices. First, the column will be converted here. After this conversion, the column is correctly outputted. However, a second conversion will take place here, and it is this conversion that screws up everything.

The reason for this awkward double conversion is because for some reason or another, some functionality does depend on that second conversion. For example, if a converter is passed in for the column, the first conversion step is skipped, and the second one is relied on.

This is yet another indication of the much needed refactoring in the read_csv world.

When sequence IDs are numbers, pandas reads index as int64 dtype (even if you specify dtype - index_col=0 overrides specification and index is read as int64 no matter what). Open issue: pandas-dev/pandas#9435

gfairchild · 2017-06-19T16:18:41Z

Note that this bug also affects read_fwf.

Down to this still open at time of writing 2.5 year old bug: pandas-dev/pandas#9435

Down to this pandas bug, 2.5 years old, still open: pandas-dev/pandas#9435

smcinerney · 2018-02-22T08:35:50Z

Just checking, are both of these issues still open on 0.20+? Is makmanalp's resolved but johanneshk's case still open (the latter is still wrong on 0.20.3). Do we perhaps need to open a separate issue?

Also, in Python 3, import StringIO -> import io ... and instantiate io.StringIO

grayfall · 2018-03-15T17:04:44Z

@smcinerney the OP's (@makmanalp) hasn't been addressed as of 0.21.1.

Down to this still open at time of writing 2.5 year old bug: pandas-dev/pandas#9435

gwerbin · 2018-10-08T08:01:45Z

Still open?

On 0.23.4:

import io
pd.read_csv(io.StringIO('''
version,downloads
1.1,100
1.2,1000
1.3,10000
'''), dtype={'version': str}, index_col='version').index.dtype
# dtype('float64')

gfyoung · 2018-10-08T13:09:19Z

@gwerbin : Indeed, this is an open issue.

gwerbin · 2018-10-08T20:34:49Z

I see.

The reason for this awkward double conversion is because for some reason or another, some functionality does depend on that second conversion. For example, if a converter is passed in for the column, the first conversion step is skipped, and the second one is relied on.

That bites. Is test coverage good enough in this area that this can be fixed without a big refactor?

gfyoung · 2018-10-08T21:05:03Z

Is test coverage good enough in this area that this can be fixed without a big refactor?

It should be. The problem is that there are a bunch of failures if you try to "correct" the behavior.

This was caused by pandas loading the index column as int See https://stackoverflow.com/questions/29792865/how-to-specify-the-dtype-of-index-when-read-a-csv-file-to-dataframe and pandas-dev/pandas#9435 for the workaround used against that Fixes GH-50

RaverJay · 2021-03-08T12:47:12Z

Ran into this today, seems still open.

Any preferred workaround in the meantime?

jreback added Bug IO CSV read_csv, to_csv labels Feb 11, 2015

jreback added this to the 0.17.0 milestone Feb 11, 2015

jreback mentioned this issue Apr 11, 2015

read_csv : string type not used for multiindex column #9849

Closed

jreback added Difficulty Intermediate labels Nov 16, 2015

jreback mentioned this issue Nov 30, 2015

read_csv with MultiIndex on columns doesn't respect level's dtypes #11728

Closed

This was referenced Apr 26, 2016

BUG: read_table error with tabs, dtype and index_col #4363

Closed

read_csv dtype 'object' doesn't work for index column #12484

Closed

read_csv index_col ignores dtype if specified #12999

Closed

jreback modified the milestones: 0.18.2, Next Major Release Apr 26, 2016

jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 21, 2016

jreback mentioned this issue Aug 25, 2016

BUG: Don't parse index column as numeric when parse_dates=True #14077

Merged

jreback mentioned this issue Oct 9, 2016

In read_csv, setting index_col breaks functionality of converters when engine='python' #14379

Closed

cduvallet mentioned this issue Nov 22, 2016

fix bug when seqIDs are numbers. swo/dbotu3#1

Merged

alubbock added a commit to alubbock/thunor that referenced this issue Aug 25, 2017

Fix case where integer plate names where treated as ints, not strs

f9b5521

Down to this still open at time of writing 2.5 year old bug: pandas-dev/pandas#9435

alubbock added a commit to alubbock/thunor-web that referenced this issue Aug 25, 2017

Workaround for integer plate name bug in parser

5b5b498

Down to this pandas bug, 2.5 years old, still open: pandas-dev/pandas#9435

jschendel mentioned this issue Mar 30, 2018

BUG: read_csv doesn't respect dtype argument for index_col #20541

Closed

alubbock added a commit to alubbock/thunor that referenced this issue Apr 1, 2018

Fix case where integer plate names where treated as ints, not strs

8436ca7

Down to this still open at time of writing 2.5 year old bug: pandas-dev/pandas#9435

mcvicuna mentioned this issue Jun 1, 2018

dtypes being ignored with certain versions of pandas LAL/trackml-library#16

Open

This was referenced Apr 24, 2019

Index Dtype Not Preserved During read_fwf #21555

Closed

IO Conflict Between index_col and dtype #25067

Closed

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

This was referenced Aug 25, 2020

BUG: read_excel doesn't honor dtype for index #35816

Open

pd.read_csv()/dtype and index_col combo #32930

Closed

RaverJay mentioned this issue Mar 8, 2021

Fixes error with only integer-like sample names replikation/poreCov#78

Merged

phofl mentioned this issue Nov 26, 2021

BUG: read_csv not applying dtype to index col #44632

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Nov 28, 2021

jreback closed this as completed in #44632 Nov 28, 2021

gaow mentioned this issue Feb 3, 2022

BUG: read_csv not applying dtype to index col when dtype is globally specified #45801

Open

3 tasks

joverlee521 mentioned this issue May 13, 2022

io.read_metadata does not always set index dtype as expected nextstrain/augur#925

Open

weiji14 mentioned this issue Jun 19, 2024

One test fails with pandas 3.0 dev version GenericMappingTools/pygmt#3291

Closed

seisman mentioned this issue Jun 24, 2024

BUG: index_col in read_csv ignores dtype argument #59077

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index_col in read_csv and read_table ignores dtype argument #9435

index_col in read_csv and read_table ignores dtype argument #9435

makmanalp commented Feb 6, 2015 •

edited by gfyoung

Loading

jreback commented Feb 6, 2015

makmanalp commented Feb 6, 2015

jreback commented Feb 11, 2015

makmanalp commented Feb 12, 2015

Eastsun commented Nov 7, 2015

jreback commented Nov 7, 2015

johanneshk commented Nov 16, 2015

jreback commented Nov 16, 2015

gfyoung commented Aug 26, 2016 •

edited

Loading

gfairchild commented Jun 19, 2017

smcinerney commented Feb 22, 2018

grayfall commented Mar 15, 2018

gwerbin commented Oct 8, 2018

gfyoung commented Oct 8, 2018

gwerbin commented Oct 8, 2018

gfyoung commented Oct 8, 2018 •

edited

Loading

RaverJay commented Mar 8, 2021

index_col in read_csv and read_table ignores dtype argument #9435

index_col in read_csv and read_table ignores dtype argument #9435

Comments

makmanalp commented Feb 6, 2015 • edited by gfyoung Loading

jreback commented Feb 6, 2015

makmanalp commented Feb 6, 2015

jreback commented Feb 11, 2015

makmanalp commented Feb 12, 2015

Eastsun commented Nov 7, 2015

jreback commented Nov 7, 2015

johanneshk commented Nov 16, 2015

jreback commented Nov 16, 2015

gfyoung commented Aug 26, 2016 • edited Loading

gfairchild commented Jun 19, 2017

smcinerney commented Feb 22, 2018

grayfall commented Mar 15, 2018

gwerbin commented Oct 8, 2018

gfyoung commented Oct 8, 2018

gwerbin commented Oct 8, 2018

gfyoung commented Oct 8, 2018 • edited Loading

RaverJay commented Mar 8, 2021

makmanalp commented Feb 6, 2015 •

edited by gfyoung

Loading

gfyoung commented Aug 26, 2016 •

edited

Loading

gfyoung commented Oct 8, 2018 •

edited

Loading