BUG: thousands separator in read_html alters data even though converter is set for a specific column #39005

GGegenhuber · 2021-01-06T17:33:11Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd

def convert_date_fix(date_str):
  return pd.to_datetime(date_str, format='%d0%m%Y')

html_table = """
<table class="gridview" style="width:600px;border-collapse:collapse;" cellspacing="0" cellpadding="4" border="0">
  <tbody>
    <tr>
      <th scope="col" align="left">Portfolio</th>
      <th scope="col" align="right">Waehrung</th>
      <th scope="col" align="right">Volumen</th>
      <th scope="col" align="right">Datum</th>
    </tr>
    <tr>
        <td align="left">Aktie A</td>
        <td align="right">EUR</td>
        <td align="right">20.320.945,77</td>
        <td align="right">05.01.2021</td>
    </tr>
    <tr>
      <td align="left">Aktie B</td>
      <td align="right">EUR</td>
      <td align="right">4.133.996,41</td>
      <td align="right">05.01.2021</td>
    </tr>
    <tr>
        <td align="left">Aktie C</td>
        <td align="right">EUR</td>
        <td align="right">3.855.218,70</td>
        <td align="right">05.01.2021</td>
      </tr>
  </tbody>
</table>"""

#ex1: date-string is converted to int64 with . separator (05.01.2021 becomes 5012021)
data_frame = pd.read_html(html_table, thousands='.', decimal=',')[0]
print(data_frame)
print(data_frame.dtypes)

#ex2: date-string is converted to object but string is altered same as above (05.01.2021 becomes 5012021)
data_frame = pd.read_html(html_table, thousands='.', decimal=',', converters={'Datum' : str})[0]
print(data_frame)
print(data_frame.dtypes)

#ex3: ugly fix that reverts conversion and acutally turns the column into a date
data_frame = pd.read_html(html_table, thousands='.', decimal=',', converters={'Datum' : convert_date_fix})[0]
print(data_frame)
print(data_frame.dtypes)

Problem description

I've set thousands and decimal separator parameters to fit the general number format that is used in Germany and also in my html-table example.

ex1: although the date column gets wrongfully interpreted as integer value, the behaviour is reasonable since no specific converters were defined for that column.

ex2 (the actual problem/bug): when setting a specific converter for the date column and thereby define it as string I'd expect the parser to keep the value untouched and return '05.01.2021', however it also interprets the string as integer and applies the corresponding conversion as above.

ex3: even when setting an own function as converter the original value is still altered and the integer conversion needs to be reverted in my helper function.

Expected Output

When a converter is defined for a specific column, the original value should be passed to the conversion function and conversions that come from thousands-separator for numeric values should not be applied.

Additional improvement/feature request

Adapt read_html to accept a date_parser as in read_csv which would make it easier to automatically parse dates in various formats. The read_html function currently has a boolean parse_dates param (same as read_csv) but in contrast does not support conversion of unconventional date formats.

see also #10684 or this stackoverflow post.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : German_Austria.1252

pandas : 1.2.0
numpy : 1.20.0rc2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.1
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

GGegenhuber added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2021

simonjayhawkins added IO HTML read_html, to_html, Styler.apply, Styler.applymap and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2021

simonjayhawkins added this to the Contributions Welcome milestone Jan 26, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: thousands separator in read_html alters data even though converter is set for a specific column #39005

BUG: thousands separator in read_html alters data even though converter is set for a specific column #39005

GGegenhuber commented Jan 6, 2021 •

edited

Loading

INSTALLED VERSIONS

BUG: thousands separator in read_html alters data even though converter is set for a specific column #39005

BUG: thousands separator in read_html alters data even though converter is set for a specific column #39005

Comments

GGegenhuber commented Jan 6, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Additional improvement/feature request

Output of pd.show_versions()

INSTALLED VERSIONS

GGegenhuber commented Jan 6, 2021 •

edited

Loading

Output of `pd.show_versions()`