Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Skipfooter disables decimal parameter #6971

Closed
GHPS opened this issue Apr 26, 2014 · 12 comments
Closed

BUG: Skipfooter disables decimal parameter #6971

GHPS opened this issue Apr 26, 2014 · 12 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@GHPS
Copy link

GHPS commented Apr 26, 2014

I ran into a bug in the read_csv importer when trying to read in a file with
a European style decimal encoding (e.g. 8.1 -> 8,2). Setting the decimal-parameter
appropriately should make this easy but in my case pandas refused to accepts any different data type than a simple object.

After a few attempts with various files and snipplets of code I nailed down the problem to the skipfooter parameter. As far as I can judge skipfooter causes the decimal parameter to be ignored. Take the following example:

In [44]:
data = 'a;b;c\n1,1;2,2;3,3\n4;5;6\n7;8;9'
data

Out[44]:
'a;b;c\n1,1;2,2;3,3\n4;5;6\n7;8;9'

In [45]:
df = pd.read_csv(io.StringIO(data), sep=";",decimal=",",dtype=np.float64)
df

Out[45]:
a b c
0 1.1 2.2 3.3
1 4.0 5.0 6.0
2 7.0 8.0 9.0

3 rows × 3 columns
In [46]:
df.dtypes

Out[46]:
a float64
b float64
c float64
dtype: object

Perfect - the behaviour I expected. Now let’s add as single line a an arbitrary footer and ignore this line in the import.

In [47]:
data = data+'\nFooter'
data

Out[47]:
'a;b;c\n1,1;2,2;3,3\n4;5;6\n7;8;9\nFooter'

In [48]:
df = pd.read_csv(io.StringIO(data), sep=";",decimal=",",dtype=np.float64,skipfooter=1)
df

Out[48]:
a b c
0 1,1 2,2 3,3
1 4 5 6
2 7 8 9
3 rows × 3 columns

In [49]:
df.dtypes

Out[49]:
a object
b object
c object
dtype: object

Now all data type information is lost supposingly because the conversion from the comma-separated to the dot-separated values failed. Adding an additional converter to the import (converters={'Rate': lambda x: float(x.replace('.','').replace(',','.'))}) fixes the problem and makes it more likely that the skipfooter routine is faulty.

System: iPython 2.0.0, Python 3.3.5, pandas 0.13.0

@jreback
Copy link
Contributor

jreback commented Apr 26, 2014

cc @mcwitt

can u take a look?

@mcwitt
Copy link
Contributor

mcwitt commented Apr 26, 2014

Hmm, since #6889 specifying decimal with skip_footer should raise:

In [3]: data = 'a;b;c\n1,1;2,2;3,3\n4;5;6\n7;8;9'

In [4]: pd.read_csv(StringIO(data), sep=';', decimal=',', skip_footer=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
. . .
ValueError: Falling back to the 'python' engine because the 'c' engine does not support skip_footer, but this causes 'decimal' to be ignored as it is not supported by the 'python' engine.

Currently neither of the parser engines can handle this combination, since the C engine can't handle skip_footer and the python engine can't handle decimal.

@jreback maybe I can look at adding support for decimal to PythonParser? I'm not familiar enough with the C engine to say if implementing skip_footer there would be easy...

@jreback
Copy link
Contributor

jreback commented Apr 26, 2014

ok gr8 so will convert this to an issue adding decimal to python parser (unless that issue already exists?)

@jreback jreback added this to the 0.15.0 milestone Apr 27, 2014
@jreback
Copy link
Contributor

jreback commented Apr 27, 2014

ok, so the error is propogated nicely in 0.14 / leaving as a bug open for 0.15

@mcwitt if you have time would be great

@GHPS
Copy link
Author

GHPS commented Apr 27, 2014

Currently neither of the parser engines can handle this combination, since the C engine can't handle
skip_footer and the python engine can't handle decimal.

To be precise: What is the official version of the parameter? skipfooter (as in 1) or skip_footer (as in 2)

Since no easy solution to the inital problem is at hand I'd make a suggestion: Drop skipfooter in favour of an enhanced version of skiprows. As a novice to pandas I expected the skipfooter functionality implemented in skiprows because a) it's the more generic term and b) it already accepts a list of lines. Intuitively I searched for something pythonic like skiprows=-2 instead of skipfooter=2. skiprows=[6,-2] would then skip 6 lines on top and 2 from the bottom.

1: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
2: http://pandas.pydata.org/pandas-docs/dev/io.html#io-read-csv-table

@mcwitt
Copy link
Contributor

mcwitt commented Apr 27, 2014

@jreback sure, I will look at this.

To be precise: What is the official version of the parameter? skipfooter (as in 1) or skip_footer (as in 2)

Looking back through some old issues (e.g. #1948) it looks like the alias skipfooter was added for consistency with skiprows. The docstring was updated to use skipfooter, but io.rst still needs to be updated.

Drop skipfooter in favour of an enhanced version of skiprows.

Hmm, this sounds like an elegant solution but I don't think it would cover all use cases: with the current convention we'd expect skiprows=[6,-2] to skip the 6th row and the 2nd row from the end (only 2 rows). I suppose we could make skiprows=-n skip the last n rows, but that would use up the skiprows argument so we couldn't easily skip a header as well...

@GHPS
Copy link
Author

GHPS commented May 7, 2014

Hmm, this sounds like an elegant solution but I don't think it would cover all use cases: with the current
convention we'd expect skiprows=[6,-2] to skip the 6th row and the 2nd row from the end (only 2 rows).

Sadly true. Since a construction with ranges becomes very fast very ugly (e.g. range(1,7,1), range(-1,-3,-1)) I'm wondering whether a tuple could be a solution: skiprows=(6,-2)

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@gfyoung
Copy link
Member

gfyoung commented Jul 26, 2016

@jreback : With the master tracker from @kawochen in place to keep an eye on these compatibility issues between the C and Python engines, this seems like a dupe to me now.

@jorisvandenbossche
Copy link
Member

Is it possible this actually fixed in master in the meantime? If I run the example from above, I get the correct float dtype

@gfyoung
Copy link
Member

gfyoung commented Jul 26, 2016

Ah, good point! Seems like the C engine got smarter with parsing since then. 😄

I guess we can close this?

@jorisvandenbossche
Copy link
Member

Any idea if there is already a test for this? Otherwise can close this by adding a test.

@gfyoung
Copy link
Member

gfyoung commented Jul 26, 2016

There is one now!

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, Next Major Release Jul 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants