Strong coupling on read_html and automatic conversion results in lost data for incorrectly inferred types #10684

sixtysecond · 2015-07-27T08:20:52Z

I am using read_html to read an html table which contains timedelta information in various formats, e.g. "20s" or "20 seconds" for 20 seconds, "5 minutes", "12 hours", "4 days". The default parser incorrectly parses these in 1) converting the values to datetimes, with the time added as a timedelta to the start of the current day and 2) setting the days values to NaT

Previously, I could set the infer_types to None and wonderfully, no types would be inferred. Now, the data is simply lost and I have no option using read_html to preserve it.

The following approaches would solve the issue for me:

re-enable infer_types=None. I fail to see the reason why this was ever disabled. Coupling of behaviors is rarely desirable in libraries. The only behavior I actually want is to read an html table into a dataframe of strings. From there, I can convert data as desired or not.
implement date_parser as in read_csv. While this would solve the issue for me, this would still result in unavoidably undesirable behaviors for the other data types, e.g. the parser auto-converting money amount to float when decimal is required.

I'd love for pandas to be my go-to library for scraping web page tables, but with current the behavior, it's pretty much useless to me.

pandas 0.16.2

python 3.4.3

import pandas as pd
import requests 
import re

url = 'http://clashofclans.wikia.com/wiki/Barbarian'

overview_regex = re.compile('Preferred|Radius')
overview_table = pd.read_html(url, match=overview_regex, header=0 )
print overview_table #expect attack speed = 20s or 20 seconds, actual = 2015-07-27 00:00:01 

print '----'
levels_regex = re.compile('Hitpoints|Total')
levels_table = pd.read_html(url, match=levels_regex, header=0 )
print levels_table #expect research time in hours or days, actual = NaT or 2015-07-27 06:00:00

The text was updated successfully, but these errors were encountered:

jreback · 2015-07-27T11:39:41Z

you can specify dtypes = { column : object } to do this.

jreback · 2015-07-27T11:39:53Z

cc @cpcloud

sixtysecond · 2015-07-27T17:03:07Z

@jreback where is the dtype option? I didn't see it referenced in the read_html documentation

jreback · 2015-07-27T17:06:49Z

it's a pass this to the TextReader (same as in read_csv); at least it should iirc

sixtysecond · 2015-07-27T18:02:05Z

@jreback I don't see any reference to TextReader in the documentation for read_html either

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

jreback · 2015-07-27T18:50:43Z

hmm, thought we were passing that through.

in any event this parses AFAICT. what version are you on?

In [13]: overview_table = pd.read_html(url, match=overview_regex, header=0 )

In [14]: overview_table[0].dtypes
Out[14]: 
Preferred Target           object
Attack Type                object
Housing Space               int64
Training Time              object
Movement Speed              int64
Attack Speed               object
Barracks Level Required     int64
Range                      object
dtype: object

In [15]: overview_table[0]       
Out[15]: 
  Preferred Target          Attack Type  Housing Space Training Time  Movement Speed Attack Speed  Barracks Level Required      Range
0             None  Melee (Ground?Only)              1           20s              16           1s                        1  0.4 tiles

sixtysecond · 2015-07-27T19:45:55Z

@jreback #pandas 0.16.2
#python 3.4.3

[ Preferred Target Attack Type Housing Space Training Time
0 None Melee (Ground Only) 1 2015-07-27 00:00:20

Movement Speed Attack Speed Barracks Level Required Range

0 16 2015-07-27 00:00:01 1 0.4 tiles ]

[ Level Damage per Second Hitpoints
0 1 8 45
1 2 11 54
2 3 14 65
3 4 18 78
4 5 23 95
5 6 26 110
6 7 30 125

Training Cost
0 25
1 40
2 60
3 100
4 150
5 200
6 250

Research Cost
0 NaN
1 50000
2 150000
3 500000
4 1500000
5 4500000
6 6000000

Laboratory Level Required Research Time
0 NaN NaT
1 1 2015-07-27 06:00:00
2 3 NaT
3 5 NaT
4 6 NaT
5 7 NaT
6 8 NaT ]

jreback · 2015-07-27T21:02:10Z

Same for me on 0.16/3.4 (it has an encoded character so maybe that's doing something weird). The parser won't infer a column to date unless ALL values for that field can be. IOW its pretty strict on the dtype inferernce.

In [26]: str(overview_table[0])
Out[26]: '  Preferred Target          Attack Type  Housing Space Training Time  Movement Speed Attack Speed  Barracks Level Required      Range\n0             None  Melee (Ground\xa0Only)              1           20s              16           1s                        1  0.4 tiles'

In [27]: overview_table[0].dtypes
Out[27]: 
Preferred Target           object
Attack Type                object
Housing Space               int64
Training Time              object
Movement Speed              int64
Attack Speed               object
Barracks Level Required     int64
Range                      object
dtype: object

In [32]: sys.version
Out[32]: '3.4.3 |Continuum Analytics, Inc.| (default, Mar  6 2015, 12:07:41) \n[GCC 4.2.1 (Apple Inc. build 5577)]'

sixtysecond · 2015-07-28T16:37:16Z

@jreback I updated pandas again and my output is matching yours. Thanks for the help.

jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Jul 27, 2015

sixtysecond closed this as completed Jul 28, 2015

jorisvandenbossche mentioned this issue Dec 5, 2015

read_html doesn't have infer_types parameter #11764

Closed

GGegenhuber mentioned this issue Jan 6, 2021

BUG: thousands separator in read_html alters data even though converter is set for a specific column #39005

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strong coupling on read_html and automatic conversion results in lost data for incorrectly inferred types #10684

Strong coupling on read_html and automatic conversion results in lost data for incorrectly inferred types #10684

sixtysecond commented Jul 27, 2015

jreback commented Jul 27, 2015

jreback commented Jul 27, 2015

sixtysecond commented Jul 27, 2015

jreback commented Jul 27, 2015

sixtysecond commented Jul 27, 2015

jreback commented Jul 27, 2015

sixtysecond commented Jul 27, 2015

jreback commented Jul 27, 2015

sixtysecond commented Jul 28, 2015

Strong coupling on read_html and automatic conversion results in lost data for incorrectly inferred types #10684

Strong coupling on read_html and automatic conversion results in lost data for incorrectly inferred types #10684

Comments

sixtysecond commented Jul 27, 2015

pandas 0.16.2

python 3.4.3

jreback commented Jul 27, 2015

jreback commented Jul 27, 2015

sixtysecond commented Jul 27, 2015

jreback commented Jul 27, 2015

sixtysecond commented Jul 27, 2015

jreback commented Jul 27, 2015

sixtysecond commented Jul 27, 2015

0 16 2015-07-27 00:00:01 1 0.4 tiles ]

jreback commented Jul 27, 2015

sixtysecond commented Jul 28, 2015