-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
index_col in read_csv and read_table ignores dtype argument #9435
Comments
iIIRC this is already fixed in master |
The values that are passed to the index constructor are object (e.g. the |
My first pandas pull request! I'll give it a shot tomorrow or over the long weekend. Thanks! Notes to self: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L721
|
This bug still existed in 0.17. |
@Eastsun doesnt the open tag on the issue make this clear? |
Is there also a quick workaround for the following case? I write multi-indexed columns to csv and then read them as follows:
Omitting the index_col messes up the columns. Rebuilding the correct structure is a bit of a pain... |
not sure how 'quick' this is....
|
Nice minimal example here: >>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a,b\n01,2'
>>> read_csv(StringIO(data), index_col='a', dtype={'a': object})
b
a
1 2 The reason for this is because we have this awkward double conversion for indices. First, the column will be converted here. After this conversion, the column is correctly outputted. However, a second conversion will take place here, and it is this conversion that screws up everything. The reason for this awkward double conversion is because for some reason or another, some functionality does depend on that second conversion. For example, if a converter is passed in for the column, the first conversion step is skipped, and the second one is relied on. This is yet another indication of the much needed refactoring in the |
When sequence IDs are numbers, pandas reads index as int64 dtype (even if you specify dtype - index_col=0 overrides specification and index is read as int64 no matter what). Open issue: pandas-dev/pandas#9435
Note that this bug also affects |
Down to this still open at time of writing 2.5 year old bug: pandas-dev/pandas#9435
Down to this pandas bug, 2.5 years old, still open: pandas-dev/pandas#9435
Just checking, are both of these issues still open on 0.20+? Is makmanalp's resolved but johanneshk's case still open (the latter is still wrong on 0.20.3). Do we perhaps need to open a separate issue? Also, in Python 3, |
@smcinerney the OP's (@makmanalp) hasn't been addressed as of 0.21.1. |
Down to this still open at time of writing 2.5 year old bug: pandas-dev/pandas#9435
Still open? On 0.23.4: import io
pd.read_csv(io.StringIO('''
version,downloads
1.1,100
1.2,1000
1.3,10000
'''), dtype={'version': str}, index_col='version').index.dtype
# dtype('float64') |
@gwerbin : Indeed, this is an open issue. |
I see.
That bites. Is test coverage good enough in this area that this can be fixed without a big refactor? |
It should be. The problem is that there are a bunch of failures if you try to "correct" the behavior. |
This was caused by pandas loading the index column as int See https://stackoverflow.com/questions/29792865/how-to-specify-the-dtype-of-index-when-read-a-csv-file-to-dataframe and pandas-dev/pandas#9435 for the workaround used against that Fixes GH-50
Ran into this today, seems still open. Any preferred workaround in the meantime? |
xref #11728 for the multi-index case
xref #14379 for converters
Version:
The text was updated successfully, but these errors were encountered: