You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This means integers of a capacity less than 32-bit are written to the database as if they were 64-bit. An example where one might get smaller integers is through the use of pd.to_numeric and downcast=True.
Since there is already a check for the col_type == "integer", I think switching the col.dtype check to this could be a possible fix:
But I'm not sure how to get started with an official PR or if this is a sane thing to do.
For context (this part is not an issue for Pandas, just explaining my interest in this issue), I discovered this when using the df.to_sql method to persist datasets to a Postgres database. From there, I use Postgraphile to auto-generate a GraphQL endpoint. I found that the BigInteger type ends up resolving as a string, because Javascript can't represent such large numbers safely. This would be fine if the source data warranted the high-capacity, but for me it often doesn't.
Expected Output
That the column types that get written to a new table in the database more tightly match their data types in Pandas.
Workarounds
If using df.to_sql, one could set the dtype argument but this can be unruly when there are many columns in the DataFrame.
For now, I have this helper method to downcast integer columns to their lowest capacity, but then back to 32-bit to persist them to the database with the desired column type:
defcast_to_int32(df):
# Downcast to the lowest possible representation.forcolindf.select_dtypes(include=['int']).columns.values:
df[col] =pd.to_numeric(df[col], downcast='integer')
# Upcast back to 32-bit (since that's what gets persisted correctly)forcolindf.select_dtypes(include=['int8', 'int16']).columns.values:
df[col] =df[col].astype('int32')
returndf
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-39-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
wolfc86
changed the title
BUG:
BUG: Improve Mappings from Pandas to SQLAlchemy Types
Jul 1, 2020
wolfc86
changed the title
BUG: Improve Mappings from Pandas to SQLAlchemy Types
BUG: Integer Capacity Higher Than Necessary in Mappings from Pandas to SQLAlchemy Types
Jul 1, 2020
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Problem description
Currently, to map the Pandas data type to a SQLAlchemy type: https://github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L1066-L1069, the code reads:
This means integers of a capacity less than 32-bit are written to the database as if they were 64-bit. An example where one might get smaller integers is through the use of pd.to_numeric and
downcast=True
.Since there is already a check for the
col_type == "integer"
, I think switching thecol.dtype
check to this could be a possible fix:But I'm not sure how to get started with an official PR or if this is a sane thing to do.
For context (this part is not an issue for Pandas, just explaining my interest in this issue), I discovered this when using the df.to_sql method to persist datasets to a Postgres database. From there, I use Postgraphile to auto-generate a GraphQL endpoint. I found that the
BigInteger
type ends up resolving as a string, because Javascript can't represent such large numbers safely. This would be fine if the source data warranted the high-capacity, but for me it often doesn't.Expected Output
That the column types that get written to a new table in the database more tightly match their data types in Pandas.
Workarounds
If using df.to_sql, one could set the
dtype
argument but this can be unruly when there are many columns in the DataFrame.For now, I have this helper method to downcast integer columns to their lowest capacity, but then back to 32-bit to persist them to the database with the desired column type:
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-39-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.17
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None
The text was updated successfully, but these errors were encountered: