-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strong coupling on read_html and automatic conversion results in lost data for incorrectly inferred types #10684
Comments
you can specify |
cc @cpcloud |
@jreback where is the dtype option? I didn't see it referenced in the read_html documentation |
it's a pass this to the TextReader (same as in read_csv); at least it should iirc |
@jreback I don't see any reference to TextReader in the documentation for read_html either http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html |
hmm, thought we were passing that through. in any event this parses AFAICT. what version are you on?
|
@jreback #pandas 0.16.2 [ Preferred Target Attack Type Housing Space Training Time Movement Speed Attack Speed Barracks Level Required Range 0 16 2015-07-27 00:00:01 1 0.4 tiles ][ Level Damage per Second Hitpoints Training Cost Research Cost Laboratory Level Required Research Time |
Same for me on 0.16/3.4 (it has an encoded character so maybe that's doing something weird). The parser won't infer a column to date unless ALL values for that field can be. IOW its pretty strict on the dtype inferernce.
|
@jreback I updated pandas again and my output is matching yours. Thanks for the help. |
I am using read_html to read an html table which contains timedelta information in various formats, e.g. "20s" or "20 seconds" for 20 seconds, "5 minutes", "12 hours", "4 days". The default parser incorrectly parses these in 1) converting the values to datetimes, with the time added as a timedelta to the start of the current day and 2) setting the days values to NaT
Previously, I could set the infer_types to None and wonderfully, no types would be inferred. Now, the data is simply lost and I have no option using read_html to preserve it.
The following approaches would solve the issue for me:
re-enable infer_types=None. I fail to see the reason why this was ever disabled. Coupling of behaviors is rarely desirable in libraries. The only behavior I actually want is to read an html table into a dataframe of strings. From there, I can convert data as desired or not.
implement date_parser as in read_csv. While this would solve the issue for me, this would still result in unavoidably undesirable behaviors for the other data types, e.g. the parser auto-converting money amount to float when decimal is required.
I'd love for pandas to be my go-to library for scraping web page tables, but with current the behavior, it's pretty much useless to me.
see also:
#7037
#4770
Code example:
pandas 0.16.2
python 3.4.3
The text was updated successfully, but these errors were encountered: