-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_sql_query type detection when missing data #14314
Comments
your link is broken. pls post a reproducible example with |
Link is fixed and version information is appended to my first post. |
However, I would think pandas should be able to handle the missing values. Because something like this happens:
Would have to dig further to see why you are getting object dtype when there are missing values. |
Better example from #14319 (not datetime, but same general issue of from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite://')
conn = engine.connect()
conn.execute("create table test (a float)")
for _ in range(5):
conn.execute("insert into test values (NULL)")
df = pd.read_sql_query("select * from test", engine, coerce_float=True)
print(df.a)
|
Here is an example with integers being detected as float while querying the oracle database : Oracle clearly describe the 'THA_ID', 'TYA_ID', 'DOA_ID' fields being an integer :
Those fields don't have a NOT NULL constraint... |
@stockersky The dtype in pandas does not depend on the column types in the sql database when using |
@stockersky Here is an example where I show that a missing value in a datetime column does not necessarily lead to a change in the dtype of the resulting dataframe: http://nbviewer.jupyter.org/gist/jorisvandenbossche/ef55675d25296d741726a20adf85211f |
@jorisvandenbossche Thanks for spending time on this issue. About the missing value in a datetime column, I have run some more tests and found an interesting behaviour :
example : this succeeds (type is np.datetime64) :
this fails (type is object) :
It seems like pandas pickups a sample of the result (first three records?) to set types.... |
I confirm with a exemple : if you have more than 2 missing values at the beginning of your dataframe column, then Pandas miss type detection. Here is a gist based on the one proposed by @jorisvandenbossche showing this behaviour : Only two missing values: type detection works. But one more missing value and it fails... |
@stockersky That observation indeed seems correct. Also without using the sql code, you see the same:
|
Closing in favor of #10285 |
I have choosen Pandas for an ETL project and encountered data type detection problems.
I already posted on StackOverflow and responses were pointing the fact that missing values can make data type detection error in Pandas.
Here is the post containing code example and behaviour :
http://stackoverflow.com/questions/39298989/python-pandas-dtypes-detection-from-sql
Briefly, it appears that while querying a database, if tuple has missing fields, then the whole column type is affected : dates are not correctly interpreted or integers turn into float.
I understand than working with flat csv files can be tricky for type detection.
However, as Pandas works with a whole database layer (SqlAlchemy, cx_Oracle, DB API), when working with a database, it should have access to metadata that describes columns type.
Am I missing something? Is this a bug? Or a function still not implemented?
INSTALLED VERSIONS
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-238.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
pandas: 0.18.1
nose: None
pip: 8.1.1
setuptools: 21.2.1
Cython: None
numpy: 1.11.0
scipy: None
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
Sincerely,
Guillaume
The text was updated successfully, but these errors were encountered: