You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
importpandasaspddefconvert_date_fix(date_str):
returnpd.to_datetime(date_str, format='%d0%m%Y')
html_table="""<table class="gridview" style="width:600px;border-collapse:collapse;" cellspacing="0" cellpadding="4" border="0"> <tbody> <tr> <th scope="col" align="left">Portfolio</th> <th scope="col" align="right">Waehrung</th> <th scope="col" align="right">Volumen</th> <th scope="col" align="right">Datum</th> </tr> <tr> <td align="left">Aktie A</td> <td align="right">EUR</td> <td align="right">20.320.945,77</td> <td align="right">05.01.2021</td> </tr> <tr> <td align="left">Aktie B</td> <td align="right">EUR</td> <td align="right">4.133.996,41</td> <td align="right">05.01.2021</td> </tr> <tr> <td align="left">Aktie C</td> <td align="right">EUR</td> <td align="right">3.855.218,70</td> <td align="right">05.01.2021</td> </tr> </tbody></table>"""#ex1: date-string is converted to int64 with . separator (05.01.2021 becomes 5012021)data_frame=pd.read_html(html_table, thousands='.', decimal=',')[0]
print(data_frame)
print(data_frame.dtypes)
#ex2: date-string is converted to object but string is altered same as above (05.01.2021 becomes 5012021)data_frame=pd.read_html(html_table, thousands='.', decimal=',', converters={'Datum' : str})[0]
print(data_frame)
print(data_frame.dtypes)
#ex3: ugly fix that reverts conversion and acutally turns the column into a datedata_frame=pd.read_html(html_table, thousands='.', decimal=',', converters={'Datum' : convert_date_fix})[0]
print(data_frame)
print(data_frame.dtypes)
Problem description
I've set thousands and decimal separator parameters to fit the general number format that is used in Germany and also in my html-table example.
ex1: although the date column gets wrongfully interpreted as integer value, the behaviour is reasonable since no specific converters were defined for that column.
ex2 (the actual problem/bug): when setting a specific converter for the date column and thereby define it as string I'd expect the parser to keep the value untouched and return '05.01.2021', however it also interprets the string as integer and applies the corresponding conversion as above.
ex3: even when setting an own function as converter the original value is still altered and the integer conversion needs to be reverted in my helper function.
Expected Output
When a converter is defined for a specific column, the original value should be passed to the conversion function and conversions that come from thousands-separator for numeric values should not be applied.
Additional improvement/feature request
Adapt read_html to accept a date_parser as in read_csv which would make it easier to automatically parse dates in various formats. The read_html function currently has a boolean parse_dates param (same as read_csv) but in contrast does not support conversion of unconventional date formats.
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
I've set
thousands
anddecimal
separator parameters to fit the general number format that is used in Germany and also in my html-table example.ex1: although the date column gets wrongfully interpreted as integer value, the behaviour is reasonable since no specific
converters
were defined for that column.ex2 (the actual problem/bug): when setting a specific
converter
for the date column and thereby define it as string I'd expect the parser to keep the value untouched and return '05.01.2021', however it also interprets the string as integer and applies the corresponding conversion as above.ex3: even when setting an own function as
converter
the original value is still altered and the integer conversion needs to be reverted in my helper function.Expected Output
When a
converter
is defined for a specific column, the original value should be passed to the conversion function and conversions that come from thousands-separator for numeric values should not be applied.Additional improvement/feature request
Adapt
read_html
to accept adate_parser
as inread_csv
which would make it easier to automatically parse dates in various formats. Theread_html
function currently has a booleanparse_dates
param (same asread_csv
) but in contrast does not support conversion of unconventional date formats.see also #10684 or this stackoverflow post.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 3e89b4c
python : 3.8.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : German_Austria.1252
pandas : 1.2.0
numpy : 1.20.0rc2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.1
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: