Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gbq.py: silently downcasting INTEGER columns to FLOAT is problematic #14020

Closed
aschmolck opened this issue Aug 17, 2016 · 4 comments
Closed

gbq.py: silently downcasting INTEGER columns to FLOAT is problematic #14020

aschmolck opened this issue Aug 17, 2016 · 4 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@aschmolck
Copy link

aschmolck commented Aug 17, 2016

In many cases INTEGER columns in bigquery are used to store IDs. Such IDs are then likely to be silently corrupted on conversion to dataframe. There seems no good way around this, even if it's clear a priori that the query result will only return non-null INTEGER columns.

I understand the motivation (the ability to represent NULL values easily and efficiently), but this both seems an undesirable situation (at the very least pandas should give a warning for integers that are not exactly convertible to floats) and inconsistent to how sql queries work in pandas.

I tried to see if it would be possible to consider the 'mode' (NULLABLE or not) of the field when doing the conversion, but that part of the schema information turns out to be of very limited value in bigquery since even trivial projections turn REQUIRED fields into NULLABLE.

I'm not sure what the best solution is, but I think bpq and sql should behave identically with respect to NULLs and that at the very least there should be a warning on information losing integer->float conversions.

So my preference would be to check the returned json query data for nulls and use np.dtype(int64) if there are none in the INTEGER column, and otherwise 'object'. Where that's deemed to inefficient, I think it's better to add a read_csv style dtypes argument to allow users to explicitly request float representation for a particular column (or just cast directly in the pseudo-SQL query itself). I'd rather have my code be slower (and possibly see a runtime warning to that effect) than unwittingly corrupt data (without any warning).

@aschmolck aschmolck changed the title silently downcasting INTEGER columns to FLOAT is problematic gbq.py: silently downcasting INTEGER columns to FLOAT is problematic Aug 17, 2016
@jorisvandenbossche
Copy link
Member

cc @parthea

@aschmolck
Copy link
Author

aschmolck commented Aug 18, 2016

I've written some code that implements the last strategy I mentioned (check the result json for null values, if there are any nulls in an INTEGER column make it of type object, else int64). I have also updated the tests, but don't know how to run them (credentials). This works well for my purposes, but it's still not consistent with read_sql which can produce either object (all null ints) int64 (all ints) or float (some ints and some nulls).

Maybe a more general solution would be to give functions like read_sql, run_query etc. a keyword argument that specifies what to do on encountering NULLs in integer columns:

null_int={'float', 'error', 'object', some_integer_value}

Here 'float' means downcast, error means throw an exception, object means use dtype object and some_integer_value allows the user to specify e.g. sys.maxsize for NULL values.

Whilst pandas's ability to work with slightly messy data is a godsend for exploratory work, one thing I really miss is are more safety knobs for production to ensure that this flexibility doesn't result in silent and sometimes difficult to spot data corruption.

So in an ideal world it would even be possible to specify this as an option (mode.null_int = ... or similar) affecting all operations that can silently cast integers, but for starters just changing the bigquery logic to behave like the sql one and leave non-null integer columns as dtype(int) would be a very welcome improvement.

I'd be happy to work one one or several PRs assuming there is interest.

@wesm
Copy link
Member

wesm commented Sep 8, 2016

This is a long standing issue that we intend to remedy at the DataFrame / Series level with the pandas-2.0 effort. Having high fidelity data interchange (i.e. not losing metadata, or even data in edge cases) with databases is extremely important.

With the current internals, it's very hard to work around these issues in a way that does not cause problems elsewhere. See:

https://pydata.github.io/pandas-design/internal-architecture.html#physical-storage-decoupling

@aschmolck
Copy link
Author

That's excellent news! I agree that the outlined approach (nullity bitmaps, abstracting logical type from representation) is the right solution, and anything else will somewhat palliative in nature.

Nonetheless, I still think it's worth making some improvements to the 0.x series of pandas, both to bridge the wait (I assume pandas 2 isn't exactly around the corner) and also to smooth the transition.

For example, allowing to specify the dtype for individual result column in sql/bigquery queries would definitely help and remain useful in pandas 2.0 (for one thing types can't always be correctly inferred).

@jreback jreback added this to the 0.20.0 milestone Feb 9, 2017
@jreback jreback closed this as completed in c23b1a4 Feb 9, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
…e 10k

closes pandas-dev#14020
closes pandas-dev#14305

Author: Piotr Chromiec <[email protected]>

Closes pandas-dev#14064 from tworec/read_gbq_full_long_support and squashes the following commits:

788ccee [Piotr Chromiec] BUG: fix read_gbq lost numeric precision
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants