gbq.py: silently downcasting INTEGER columns to FLOAT is problematic #14020

aschmolck · 2016-08-17T13:14:35Z

In many cases INTEGER columns in bigquery are used to store IDs. Such IDs are then likely to be silently corrupted on conversion to dataframe. There seems no good way around this, even if it's clear a priori that the query result will only return non-null INTEGER columns.

I understand the motivation (the ability to represent NULL values easily and efficiently), but this both seems an undesirable situation (at the very least pandas should give a warning for integers that are not exactly convertible to floats) and inconsistent to how sql queries work in pandas.

I tried to see if it would be possible to consider the 'mode' (NULLABLE or not) of the field when doing the conversion, but that part of the schema information turns out to be of very limited value in bigquery since even trivial projections turn REQUIRED fields into NULLABLE.

I'm not sure what the best solution is, but I think bpq and sql should behave identically with respect to NULLs and that at the very least there should be a warning on information losing integer->float conversions.

So my preference would be to check the returned json query data for nulls and use np.dtype(int64) if there are none in the INTEGER column, and otherwise 'object'. Where that's deemed to inefficient, I think it's better to add a read_csv style dtypes argument to allow users to explicitly request float representation for a particular column (or just cast directly in the pseudo-SQL query itself). I'd rather have my code be slower (and possibly see a runtime warning to that effect) than unwittingly corrupt data (without any warning).

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-08-17T22:47:43Z

cc @parthea

aschmolck · 2016-08-18T16:54:22Z

I've written some code that implements the last strategy I mentioned (check the result json for null values, if there are any nulls in an INTEGER column make it of type object, else int64). I have also updated the tests, but don't know how to run them (credentials). This works well for my purposes, but it's still not consistent with read_sql which can produce either object (all null ints) int64 (all ints) or float (some ints and some nulls).

Maybe a more general solution would be to give functions like read_sql, run_query etc. a keyword argument that specifies what to do on encountering NULLs in integer columns:

null_int={'float', 'error', 'object', some_integer_value}

Here 'float' means downcast, error means throw an exception, object means use dtype object and some_integer_value allows the user to specify e.g. sys.maxsize for NULL values.

Whilst pandas's ability to work with slightly messy data is a godsend for exploratory work, one thing I really miss is are more safety knobs for production to ensure that this flexibility doesn't result in silent and sometimes difficult to spot data corruption.

So in an ideal world it would even be possible to specify this as an option (mode.null_int = ... or similar) affecting all operations that can silently cast integers, but for starters just changing the bigquery logic to behave like the sql one and leave non-null integer columns as dtype(int) would be a very welcome improvement.

I'd be happy to work one one or several PRs assuming there is interest.

wesm · 2016-09-08T15:31:20Z

This is a long standing issue that we intend to remedy at the DataFrame / Series level with the pandas-2.0 effort. Having high fidelity data interchange (i.e. not losing metadata, or even data in edge cases) with databases is extremely important.

With the current internals, it's very hard to work around these issues in a way that does not cause problems elsewhere. See:

https://pydata.github.io/pandas-design/internal-architecture.html#physical-storage-decoupling

aschmolck · 2016-09-08T23:20:55Z

That's excellent news! I agree that the outlined approach (nullity bitmaps, abstracting logical type from representation) is the right solution, and anything else will somewhat palliative in nature.

Nonetheless, I still think it's worth making some improvements to the 0.x series of pandas, both to bridge the wait (I assume pandas 2 isn't exactly around the corner) and also to smooth the transition.

For example, allowing to specify the dtype for individual result column in sql/bigquery queries would definitely help and remain useful in pandas 2.0 (for one thing types can't always be correctly inferred).

…e 10k closes pandas-dev#14020 closes pandas-dev#14305 Author: Piotr Chromiec <[email protected]> Closes pandas-dev#14064 from tworec/read_gbq_full_long_support and squashes the following commits: 788ccee [Piotr Chromiec] BUG: fix read_gbq lost numeric precision

aschmolck changed the title ~~silently downcasting INTEGER columns to FLOAT is problematic~~ gbq.py: silently downcasting INTEGER columns to FLOAT is problematic Aug 17, 2016

jreback added the IO Google label Aug 17, 2016

jreback mentioned this issue Aug 25, 2016

BUG: fix read_gbq lost precision for longs above 2^53 and floats above 10k #14064

Closed

5 tasks

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Sep 8, 2016

jreback mentioned this issue Sep 8, 2016

dtype precision / conversions wesm/pandas2#20

Open

jreback added this to the 0.20.0 milestone Feb 9, 2017

jreback closed this as completed in c23b1a4 Feb 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gbq.py: silently downcasting INTEGER columns to FLOAT is problematic #14020

gbq.py: silently downcasting INTEGER columns to FLOAT is problematic #14020

aschmolck commented Aug 17, 2016 •

edited

Loading

jorisvandenbossche commented Aug 17, 2016

aschmolck commented Aug 18, 2016 •

edited

Loading

wesm commented Sep 8, 2016

aschmolck commented Sep 8, 2016

gbq.py: silently downcasting INTEGER columns to FLOAT is problematic #14020

gbq.py: silently downcasting INTEGER columns to FLOAT is problematic #14020

Comments

aschmolck commented Aug 17, 2016 • edited Loading

jorisvandenbossche commented Aug 17, 2016

aschmolck commented Aug 18, 2016 • edited Loading

wesm commented Sep 8, 2016

aschmolck commented Sep 8, 2016

aschmolck commented Aug 17, 2016 •

edited

Loading

aschmolck commented Aug 18, 2016 •

edited

Loading