-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gbq.py: silently downcasting INTEGER columns to FLOAT is problematic #14020
Comments
cc @parthea |
I've written some code that implements the last strategy I mentioned (check the result json for null values, if there are any nulls in an INTEGER column make it of type object, else int64). I have also updated the tests, but don't know how to run them (credentials). This works well for my purposes, but it's still not consistent with read_sql which can produce either object (all null ints) int64 (all ints) or float (some ints and some nulls). Maybe a more general solution would be to give functions like read_sql, run_query etc. a keyword argument that specifies what to do on encountering NULLs in integer columns:
Here 'float' means downcast, error means throw an exception, object means use dtype object and some_integer_value allows the user to specify e.g. Whilst pandas's ability to work with slightly messy data is a godsend for exploratory work, one thing I really miss is are more safety knobs for production to ensure that this flexibility doesn't result in silent and sometimes difficult to spot data corruption. So in an ideal world it would even be possible to specify this as an option I'd be happy to work one one or several PRs assuming there is interest. |
This is a long standing issue that we intend to remedy at the DataFrame / Series level with the pandas-2.0 effort. Having high fidelity data interchange (i.e. not losing metadata, or even data in edge cases) with databases is extremely important. With the current internals, it's very hard to work around these issues in a way that does not cause problems elsewhere. See: https://pydata.github.io/pandas-design/internal-architecture.html#physical-storage-decoupling |
That's excellent news! I agree that the outlined approach (nullity bitmaps, abstracting logical type from representation) is the right solution, and anything else will somewhat palliative in nature. Nonetheless, I still think it's worth making some improvements to the 0.x series of pandas, both to bridge the wait (I assume pandas 2 isn't exactly around the corner) and also to smooth the transition. For example, allowing to specify the dtype for individual result column in sql/bigquery queries would definitely help and remain useful in pandas 2.0 (for one thing types can't always be correctly inferred). |
…e 10k closes pandas-dev#14020 closes pandas-dev#14305 Author: Piotr Chromiec <[email protected]> Closes pandas-dev#14064 from tworec/read_gbq_full_long_support and squashes the following commits: 788ccee [Piotr Chromiec] BUG: fix read_gbq lost numeric precision
In many cases INTEGER columns in bigquery are used to store IDs. Such IDs are then likely to be silently corrupted on conversion to dataframe. There seems no good way around this, even if it's clear a priori that the query result will only return non-null INTEGER columns.
I understand the motivation (the ability to represent NULL values easily and efficiently), but this both seems an undesirable situation (at the very least pandas should give a warning for integers that are not exactly convertible to floats) and inconsistent to how sql queries work in pandas.
I tried to see if it would be possible to consider the 'mode' (NULLABLE or not) of the field when doing the conversion, but that part of the schema information turns out to be of very limited value in bigquery since even trivial projections turn REQUIRED fields into NULLABLE.
I'm not sure what the best solution is, but I think bpq and sql should behave identically with respect to NULLs and that at the very least there should be a warning on information losing integer->float conversions.
So my preference would be to check the returned json query data for nulls and use np.dtype(int64) if there are none in the INTEGER column, and otherwise 'object'. Where that's deemed to inefficient, I think it's better to add a
read_csv
style dtypes argument to allow users to explicitly request float representation for a particular column (or just cast directly in the pseudo-SQL query itself). I'd rather have my code be slower (and possibly see a runtime warning to that effect) than unwittingly corrupt data (without any warning).The text was updated successfully, but these errors were encountered: