Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_fwf does not support dtype argument #7141

Closed
brendon9x opened this issue May 16, 2014 · 13 comments · Fixed by #14768
Closed

read_fwf does not support dtype argument #7141

brendon9x opened this issue May 16, 2014 · 13 comments · Fixed by #14768
Labels
Enhancement IO CSV read_csv, to_csv

Comments

@brendon9x
Copy link

The documentation implies that you can supply a dtype dict to read_fwf, but in reality this option is silently dropped as it looks like it's only supported by the c parser. My specific use case is loading Triple-S files which are fairly prolific in the market research world. The Triple-S standard is basically an XML file which describes a fixed width file. It's fairly trivial to get this to work in Pandas in a few lines of code which is great.

The problem arises when these files become really large. I tried using chunked conversion to HDF5 using append_to_multiple but ran into a baffling problem of certain chunks failing on append. Stepping through the code, it looked like the underlying block layouts where different per chunk. And this in turn was caused by the fact that column inference is applied per chunk and dtypes are ignored. I suspect this is caused by data being missing in some chunks and not in others.

The lowest hanging fruit is to update that docs and I'm happy to do a PR for this. But it would be awesome if the c parser could be tweaked to allow reading fixed width files as this issue would go away and we'd get a huge speed boost. This initially looked hard, but then started to look like adding a simpler statemachine might be possible if colspecs could be passed in. I could possibly do this as a PR, but would probably need some PR hand holding. Finally, if you could point out a simple place to apply the dtype argument in the read_fwf parser, I can give it a go as a second best case PR.

@jreback
Copy link
Contributor

jreback commented May 16, 2014

http://pandas-docs.github.io/pandas-docs-travis/io.html#specifying-column-data-types

dtype inference is currently only supported in the c engine
read_fwf uses the python engine

that said adding a dtype arg to the python engine is pretty easy

then read_fwf should follow from their

@jreback
Copy link
Contributor

jreback commented May 16, 2014

read_fwf should raise if dtype is passed - is that not the case? (on master)

FYI most dtype issues are because you have an int64 dtype in one chunk and Nan's in another hence it's float64

@jreback jreback changed the title read_fwf does not support dtype argument read_fwf does not support dtype argument May 16, 2014
@jreback jreback added this to the 0.15.0 milestone May 16, 2014
@jreback
Copy link
Contributor

jreback commented May 16, 2014

related: #6889

@mcwitt did we add the warning for read_fwf which calls the python parser (when specifying dtype)?

@jreback
Copy link
Contributor

jreback commented May 16, 2014

In [2]:         data1 = """\
   ...: 201158    360.242940   149.910199   11950.7
   ...: 201159    444.953632   166.985655   11788.4
   ...: 201160    364.136849   183.628767   11806.2
   ...: 201161    413.836124   184.375703   11916.8
   ...: 201162    502.953953   173.237159   12468.3
   ...: """

In [3]:         colspecs = [(0, 4), (4, 8), (8, 20), (21, 33), (34, 43)]

In [5]:         df = pd.read_fwf(StringIO(data1), colspecs=colspecs, header=None)

In [6]: df
Out[6]: 
      0   1           2           3        4
0  2011  58  360.242940  149.910199  11950.7
1  2011  59  444.953632  166.985655  11788.4
2  2011  60  364.136849  183.628767  11806.2
3  2011  61  413.836124  184.375703  11916.8
4  2011  62  502.953953  173.237159  12468.3

In [7]: df.dtypes
Out[7]: 
0      int64
1      int64
2    float64
3    float64
4    float64
dtype: object

In [8]:         df = pd.read_fwf(StringIO(data1), colspecs=colspecs, header=None,dtype={0 : 'float64'})
ValueError: The 'dtype' option is not supported with the 'python-fwf' engine

So this is correctly raising in master (so tested for the option not-allowed in python-fwf).

@brendon9x
Copy link
Author

Hi Jeff,

Thanks for getting back so quickly and sorry for not doing the same. I should have checked master before mentioning the lack of error message, so thanks retrospectively for having fixed that. I think having looked at the source, it is possible for me to work around the issue using the converters argument.

If you're amenable to a PR though, I'm keen to try with your guidance as I think converters are probably going to be slow (maybe numba though?). There are a few ways I've spotted to add this functionality in ascending order of ambition:

  1. Convert a dtype argument into a converters argument. Looks easy and purely additive, but probably very slow.
  2. Add a new Cython function try_convert_using_dtype based on maybe_convert_numeric but simplified to throw on error (like np.int64 casting a nan).
  3. Possibly just use standard numpy.astype instead in _convert_types. (guessing not because of all the options).
  4. Look for ways to adapt the c parser. I'm excited by the potential speed of fwf files, but the state machine in there is maybe not needed at all. Perhaps too much of complexity sacrifice.

Do any of these options sound right? Is there another way? And would be happy for me to take a stab at it?

@jreback
Copy link
Contributor

jreback commented May 26, 2014

No problem! So here's the approach that I would take:

Fixed width reading is just a sub-class of PythonParser, and that needs dtype support (the c-parser ALREADY has dtype support directly).

I think it should go just about here somewhere: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1833

Essentially you just try to convert the specified column (using astype in a try: except: block). If it works then its good, otherwise not.

You can look in the c-parser for reference here: https://github.com/pydata/pandas/blob/master/pandas/parser.pyx#L1216 Though that is a LOT more complicated because its in c-land and not so friendly.

you also need to do some conversions on the passed dtypes (e.g. you run then thru np.dtypein a try-except, as users can pass myriad of things!). Then you need to convert those a bit (the main one is all string ones get converted to object, e.g. passing S64 yields an object dtype), you need to do something like this: https://github.com/pydata/pandas/blob/master/pandas/parser.pyx#L998 (note that you really can't 'reuse' this code, but no worries as the python code for this is MUCH simpler).

The main thing this needs is a bunch of tests (for possible weird user input), but I think c-parser has a bunch of tests (that are just turned off for Python/FixedWidth parser ATM - that's where the warning/error comes from).

lmk as you progress

@gfyoung
Copy link
Member

gfyoung commented May 27, 2016

@jreback : Is it just that simple that we do try-except for each specified dtype? Why can it be that simple compared to the C engine? I didn't quite follow your explanation above.

@jreback
Copy link
Contributor

jreback commented May 27, 2016

In python I think it IS that simple. The c-engine does all of this in cython (basically the same thing); though it does it with much more c-code, which happens to not hold the GIL (though it wasn't that way to start). I don't know why it was not done with try-except; could be efficiency actually; though most of the c-code is doing .astype

@gfyoung
Copy link
Member

gfyoung commented May 27, 2016

Fair enough. Will give it a shot and see what happens.

@jreback
Copy link
Contributor

jreback commented May 27, 2016

btw there might be more issues (that are open) w.r.t. dtype handling by python engine. IIRC

@gfyoung
Copy link
Member

gfyoung commented May 27, 2016

Perhaps. The more I keep looking at this, the more complicated the issue seems to be because of that inconsistent handling of converters, dtypes, and coerced casting. I could put the attempted dtype casting where you had initially put it, but it isn't consistent with what the C engine does because currently the C engine doesn't respect dtype if there are converters involved (similar to #13302). The casting behaviour on both engines is not aligned in fact, and it's surprising to an extent how consistent their behaviour has been in testing.

@gfyoung
Copy link
Member

gfyoung commented Nov 26, 2016

@chris-b1 : Does #14295 resolve this issue?

@chris-b1
Copy link
Contributor

It should, although I didn't add any docs or tests for that case - I'll do a follow-up PR, thanks for noticing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants