read_fwf does not support dtype argument #7141

brendon9x · 2014-05-16T10:44:50Z

The documentation implies that you can supply a dtype dict to read_fwf, but in reality this option is silently dropped as it looks like it's only supported by the c parser. My specific use case is loading Triple-S files which are fairly prolific in the market research world. The Triple-S standard is basically an XML file which describes a fixed width file. It's fairly trivial to get this to work in Pandas in a few lines of code which is great.

The problem arises when these files become really large. I tried using chunked conversion to HDF5 using append_to_multiple but ran into a baffling problem of certain chunks failing on append. Stepping through the code, it looked like the underlying block layouts where different per chunk. And this in turn was caused by the fact that column inference is applied per chunk and dtypes are ignored. I suspect this is caused by data being missing in some chunks and not in others.

The lowest hanging fruit is to update that docs and I'm happy to do a PR for this. But it would be awesome if the c parser could be tweaked to allow reading fixed width files as this issue would go away and we'd get a huge speed boost. This initially looked hard, but then started to look like adding a simpler statemachine might be possible if colspecs could be passed in. I could possibly do this as a PR, but would probably need some PR hand holding. Finally, if you could point out a simple place to apply the dtype argument in the read_fwf parser, I can give it a go as a second best case PR.

The text was updated successfully, but these errors were encountered:

jreback · 2014-05-16T11:06:00Z

http://pandas-docs.github.io/pandas-docs-travis/io.html#specifying-column-data-types

dtype inference is currently only supported in the c engine
read_fwf uses the python engine

that said adding a dtype arg to the python engine is pretty easy

then read_fwf should follow from their

jreback · 2014-05-16T11:08:36Z

read_fwf should raise if dtype is passed - is that not the case? (on master)

FYI most dtype issues are because you have an int64 dtype in one chunk and Nan's in another hence it's float64

jreback · 2014-05-16T12:17:35Z

related: #6889

@mcwitt did we add the warning for read_fwf which calls the python parser (when specifying dtype)?

jreback · 2014-05-16T17:16:45Z

In [2]:         data1 = """\
   ...: 201158    360.242940   149.910199   11950.7
   ...: 201159    444.953632   166.985655   11788.4
   ...: 201160    364.136849   183.628767   11806.2
   ...: 201161    413.836124   184.375703   11916.8
   ...: 201162    502.953953   173.237159   12468.3
   ...: """

In [3]:         colspecs = [(0, 4), (4, 8), (8, 20), (21, 33), (34, 43)]

In [5]:         df = pd.read_fwf(StringIO(data1), colspecs=colspecs, header=None)

In [6]: df
Out[6]: 
      0   1           2           3        4
0  2011  58  360.242940  149.910199  11950.7
1  2011  59  444.953632  166.985655  11788.4
2  2011  60  364.136849  183.628767  11806.2
3  2011  61  413.836124  184.375703  11916.8
4  2011  62  502.953953  173.237159  12468.3

In [7]: df.dtypes
Out[7]: 
0      int64
1      int64
2    float64
3    float64
4    float64
dtype: object

In [8]:         df = pd.read_fwf(StringIO(data1), colspecs=colspecs, header=None,dtype={0 : 'float64'})
ValueError: The 'dtype' option is not supported with the 'python-fwf' engine

So this is correctly raising in master (so tested for the option not-allowed in python-fwf).

brendon9x · 2014-05-26T21:42:04Z

Hi Jeff,

Thanks for getting back so quickly and sorry for not doing the same. I should have checked master before mentioning the lack of error message, so thanks retrospectively for having fixed that. I think having looked at the source, it is possible for me to work around the issue using the converters argument.

If you're amenable to a PR though, I'm keen to try with your guidance as I think converters are probably going to be slow (maybe numba though?). There are a few ways I've spotted to add this functionality in ascending order of ambition:

Convert a dtype argument into a converters argument. Looks easy and purely additive, but probably very slow.
Add a new Cython function try_convert_using_dtype based on maybe_convert_numeric but simplified to throw on error (like np.int64 casting a nan).
Possibly just use standard numpy.astype instead in _convert_types. (guessing not because of all the options).
Look for ways to adapt the c parser. I'm excited by the potential speed of fwf files, but the state machine in there is maybe not needed at all. Perhaps too much of complexity sacrifice.

Do any of these options sound right? Is there another way? And would be happy for me to take a stab at it?

jreback · 2014-05-26T22:10:08Z

No problem! So here's the approach that I would take:

Fixed width reading is just a sub-class of PythonParser, and that needs dtype support (the c-parser ALREADY has dtype support directly).

I think it should go just about here somewhere: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1833

Essentially you just try to convert the specified column (using astype in a try: except: block). If it works then its good, otherwise not.

You can look in the c-parser for reference here: https://github.com/pydata/pandas/blob/master/pandas/parser.pyx#L1216 Though that is a LOT more complicated because its in c-land and not so friendly.

you also need to do some conversions on the passed dtypes (e.g. you run then thru np.dtypein a try-except, as users can pass myriad of things!). Then you need to convert those a bit (the main one is all string ones get converted to object, e.g. passing S64 yields an object dtype), you need to do something like this: https://github.com/pydata/pandas/blob/master/pandas/parser.pyx#L998 (note that you really can't 'reuse' this code, but no worries as the python code for this is MUCH simpler).

The main thing this needs is a bunch of tests (for possible weird user input), but I think c-parser has a bunch of tests (that are just turned off for Python/FixedWidth parser ATM - that's where the warning/error comes from).

lmk as you progress

gfyoung · 2016-05-27T00:19:09Z

@jreback : Is it just that simple that we do try-except for each specified dtype? Why can it be that simple compared to the C engine? I didn't quite follow your explanation above.

jreback · 2016-05-27T00:26:23Z

In python I think it IS that simple. The c-engine does all of this in cython (basically the same thing); though it does it with much more c-code, which happens to not hold the GIL (though it wasn't that way to start). I don't know why it was not done with try-except; could be efficiency actually; though most of the c-code is doing .astype

gfyoung · 2016-05-27T00:28:03Z

Fair enough. Will give it a shot and see what happens.

jreback · 2016-05-27T00:37:37Z

btw there might be more issues (that are open) w.r.t. dtype handling by python engine. IIRC

gfyoung · 2016-05-27T22:52:34Z

Perhaps. The more I keep looking at this, the more complicated the issue seems to be because of that inconsistent handling of converters, dtypes, and coerced casting. I could put the attempted dtype casting where you had initially put it, but it isn't consistent with what the C engine does because currently the C engine doesn't respect dtype if there are converters involved (similar to #13302). The casting behaviour on both engines is not aligned in fact, and it's surprising to an extent how consistent their behaviour has been in testing.

gfyoung · 2016-11-26T22:30:48Z

@chris-b1 : Does #14295 resolve this issue?

chris-b1 · 2016-11-26T23:23:49Z

It should, although I didn't add any docs or tests for that case - I'll do a follow-up PR, thanks for noticing this.

jreback changed the title ~~read_fwf does not support dtype argument~~ read_fwf does not support dtype argument May 16, 2014

jreback added CSV labels May 16, 2014

jreback added this to the 0.15.0 milestone May 16, 2014

jreback mentioned this issue Sep 29, 2014

read_fwf engine issue #8422

Closed

socheon mentioned this issue Sep 29, 2014

read_csv dtype argument not working when there is a footer #5232

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Jul 20, 2015

BUG: Fix typo-related bug to resolve #9266 #10576

Closed

chris-b1 mentioned this issue Nov 30, 2016

DOC/TST: dtype param in read_fwf #14768

Merged

4 tasks

jorisvandenbossche closed this as completed in #14768 Nov 30, 2016

stefdoerr mentioned this issue Jun 13, 2018

VMD-style select doesn't work when resname is all-digits Acellera/htmd#712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_fwf does not support dtype argument #7141

read_fwf does not support dtype argument #7141

brendon9x commented May 16, 2014

jreback commented May 16, 2014

jreback commented May 16, 2014

jreback commented May 16, 2014

jreback commented May 16, 2014

brendon9x commented May 26, 2014

jreback commented May 26, 2014

gfyoung commented May 27, 2016

jreback commented May 27, 2016

gfyoung commented May 27, 2016

jreback commented May 27, 2016

gfyoung commented May 27, 2016

gfyoung commented Nov 26, 2016

chris-b1 commented Nov 26, 2016

read_fwf does not support dtype argument #7141

read_fwf does not support dtype argument #7141

Comments

brendon9x commented May 16, 2014

jreback commented May 16, 2014

jreback commented May 16, 2014

jreback commented May 16, 2014

jreback commented May 16, 2014

brendon9x commented May 26, 2014

jreback commented May 26, 2014

gfyoung commented May 27, 2016

jreback commented May 27, 2016

gfyoung commented May 27, 2016

jreback commented May 27, 2016

gfyoung commented May 27, 2016

gfyoung commented Nov 26, 2016

chris-b1 commented Nov 26, 2016