-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_fwf does not support dtype argument #7141
Comments
http://pandas-docs.github.io/pandas-docs-travis/io.html#specifying-column-data-types dtype inference is currently only supported in the c engine that said adding a dtype arg to the python engine is pretty easy then read_fwf should follow from their |
read_fwf should raise if dtype is passed - is that not the case? (on master) FYI most dtype issues are because you have an int64 dtype in one chunk and Nan's in another hence it's float64 |
read_fwf
does not support dtype argument
So this is correctly raising in master (so tested for the option not-allowed in python-fwf). |
Hi Jeff, Thanks for getting back so quickly and sorry for not doing the same. I should have checked master before mentioning the lack of error message, so thanks retrospectively for having fixed that. I think having looked at the source, it is possible for me to work around the issue using the If you're amenable to a PR though, I'm keen to try with your guidance as I think converters are probably going to be slow (maybe numba though?). There are a few ways I've spotted to add this functionality in ascending order of ambition:
Do any of these options sound right? Is there another way? And would be happy for me to take a stab at it? |
No problem! So here's the approach that I would take: Fixed width reading is just a sub-class of I think it should go just about here somewhere: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1833 Essentially you just try to convert the specified column (using You can look in the c-parser for reference here: https://github.com/pydata/pandas/blob/master/pandas/parser.pyx#L1216 Though that is a LOT more complicated because its in c-land and not so friendly. you also need to do some conversions on the passed dtypes (e.g. you run then thru The main thing this needs is a bunch of tests (for possible weird user input), but I think c-parser has a bunch of tests (that are just turned off for Python/FixedWidth parser ATM - that's where the warning/error comes from). lmk as you progress |
@jreback : Is it just that simple that we do |
In python I think it IS that simple. The c-engine does all of this in cython (basically the same thing); though it does it with much more c-code, which happens to not hold the GIL (though it wasn't that way to start). I don't know why it was not done with |
Fair enough. Will give it a shot and see what happens. |
btw there might be more issues (that are open) w.r.t. dtype handling by python engine. IIRC |
Perhaps. The more I keep looking at this, the more complicated the issue seems to be because of that inconsistent handling of converters, dtypes, and coerced casting. I could put the attempted |
It should, although I didn't add any docs or tests for that case - I'll do a follow-up PR, thanks for noticing this. |
The documentation implies that you can supply a
dtype
dict toread_fwf
, but in reality this option is silently dropped as it looks like it's only supported by the c parser. My specific use case is loading Triple-S files which are fairly prolific in the market research world. The Triple-S standard is basically an XML file which describes a fixed width file. It's fairly trivial to get this to work in Pandas in a few lines of code which is great.The problem arises when these files become really large. I tried using chunked conversion to HDF5 using
append_to_multiple
but ran into a baffling problem of certain chunks failing on append. Stepping through the code, it looked like the underlying block layouts where different per chunk. And this in turn was caused by the fact that column inference is applied per chunk and dtypes are ignored. I suspect this is caused by data being missing in some chunks and not in others.The lowest hanging fruit is to update that docs and I'm happy to do a PR for this. But it would be awesome if the c parser could be tweaked to allow reading fixed width files as this issue would go away and we'd get a huge speed boost. This initially looked hard, but then started to look like adding a simpler statemachine might be possible if colspecs could be passed in. I could possibly do this as a PR, but would probably need some PR hand holding. Finally, if you could point out a simple place to apply the dtype argument in the read_fwf parser, I can give it a go as a second best case PR.
The text was updated successfully, but these errors were encountered: