Skip to content

Commit

Permalink
BUG: fix read_gbq lost precision for longs above 2^53 and floats abov…
Browse files Browse the repository at this point in the history
…e 10k

closes #14020
closes #14305

Author: Piotr Chromiec <[email protected]>

Closes #14064 from tworec/read_gbq_full_long_support and squashes the following commits:

788ccee [Piotr Chromiec] BUG: fix read_gbq lost numeric precision
  • Loading branch information
Piotr Chromiec authored and jreback committed Feb 9, 2017
1 parent 3c9fec3 commit c23b1a4
Show file tree
Hide file tree
Showing 5 changed files with 263 additions and 128 deletions.
13 changes: 5 additions & 8 deletions doc/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -250,9 +250,9 @@ Optional Dependencies
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:

- `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL
- `pymysql <https://github.com/PyMySQL/PyMySQL>`__: for MySQL.
- `SQLite <https://docs.python.org/3.5/library/sqlite3.html>`__: for SQLite, this is included in Python's standard library by default.
* `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL
* `pymysql <https://github.com/PyMySQL/PyMySQL>`__: for MySQL.
* `SQLite <https://docs.python.org/3.5/library/sqlite3.html>`__: for SQLite, this is included in Python's standard library by default.

* `matplotlib <http://matplotlib.org/>`__: for plotting
* For Excel I/O:
Expand All @@ -272,11 +272,8 @@ Optional Dependencies
<http://www.vergenet.net/~conrad/software/xsel/>`__, or `xclip
<https://github.com/astrand/xclip/>`__: necessary to use
:func:`~pandas.read_clipboard`. Most package managers on Linux distributions will have ``xclip`` and/or ``xsel`` immediately available for installation.
* Google's `python-gflags <<https://github.com/google/python-gflags/>`__ ,
`oauth2client <https://github.com/google/oauth2client>`__ ,
`httplib2 <http://pypi.python.org/pypi/httplib2>`__
and `google-api-python-client <http://github.com/google/google-api-python-client>`__
: Needed for :mod:`~pandas.io.gbq`
* For Google BigQuery I/O - see :ref:`here <io.bigquery_deps>`.

* `Backports.lzma <https://pypi.python.org/pypi/backports.lzma/>`__: Only for Python 2, for writing to and/or reading from an xz compressed DataFrame in CSV; Python 3 support is built into the standard library.
* One of the following combinations of libraries is needed to use the
top-level :func:`~pandas.read_html` function:
Expand Down
61 changes: 47 additions & 14 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ object.
* :ref:`read_json<io.json_reader>`
* :ref:`read_msgpack<io.msgpack>`
* :ref:`read_html<io.read_html>`
* :ref:`read_gbq<io.bigquery_reader>`
* :ref:`read_gbq<io.bigquery>`
* :ref:`read_stata<io.stata_reader>`
* :ref:`read_sas<io.sas_reader>`
* :ref:`read_clipboard<io.clipboard>`
Expand All @@ -55,7 +55,7 @@ The corresponding ``writer`` functions are object methods that are accessed like
* :ref:`to_json<io.json_writer>`
* :ref:`to_msgpack<io.msgpack>`
* :ref:`to_html<io.html>`
* :ref:`to_gbq<io.bigquery_writer>`
* :ref:`to_gbq<io.bigquery>`
* :ref:`to_stata<io.stata_writer>`
* :ref:`to_clipboard<io.clipboard>`
* :ref:`to_pickle<io.pickle>`
Expand Down Expand Up @@ -4648,16 +4648,11 @@ DataFrame with a shape and data types derived from the source table.
Additionally, DataFrames can be inserted into new BigQuery tables or appended
to existing tables.

You will need to install some additional dependencies:

- Google's `python-gflags <https://github.com/google/python-gflags/>`__
- `httplib2 <http://pypi.python.org/pypi/httplib2>`__
- `google-api-python-client <http://github.com/google/google-api-python-client>`__

.. warning::

To use this module, you will need a valid BigQuery account. Refer to the
`BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__ for details on the service itself.
`BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__
for details on the service itself.

The key functions are:

Expand All @@ -4671,7 +4666,44 @@ The key functions are:

.. currentmodule:: pandas

.. _io.bigquery_reader:

Supported Data Types
++++++++++++++++++++

Pandas supports all these `BigQuery data types <https://cloud.google.com/bigquery/data-types>`__:
``STRING``, ``INTEGER`` (64bit), ``FLOAT`` (64 bit), ``BOOLEAN`` and
``TIMESTAMP`` (microsecond precision). Data types ``BYTES`` and ``RECORD``
are not supported.

Integer and boolean ``NA`` handling
+++++++++++++++++++++++++++++++++++

.. versionadded:: 0.20

Since all columns in BigQuery queries are nullable, and NumPy lacks of ``NA``
support for integer and boolean types, this module will store ``INTEGER`` or
``BOOLEAN`` columns with at least one ``NULL`` value as ``dtype=object``.
Otherwise those columns will be stored as ``dtype=int64`` or ``dtype=bool``
respectively.

This is opposite to default pandas behaviour which will promote integer
type to float in order to store NAs. See the :ref:`gotchas<gotchas.intna>`
for detailed explaination.

While this trade-off works well for most cases, it breaks down for storing
values greater than 2**53. Such values in BigQuery can represent identifiers
and unnoticed precision lost for identifier is what we want to avoid.

.. _io.bigquery_deps:

Dependencies
++++++++++++

This module requires following additional dependencies:

- `httplib2 <https://github.com/httplib2/httplib2>`__: HTTP client
- `google-api-python-client <http://github.com/google/google-api-python-client>`__: Google's API client
- `oauth2client <https://github.com/google/oauth2client>`__: authentication and authorization for Google's API

.. _io.bigquery_authentication:

Expand All @@ -4686,7 +4718,7 @@ Is possible to authenticate with either user account credentials or service acco
Authenticating with user account credentials is as simple as following the prompts in a browser window
which will be automatically opened for you. You will be authenticated to the specified
``BigQuery`` account using the product name ``pandas GBQ``. It is only possible on local host.
The remote authentication using user account credentials is not currently supported in Pandas.
The remote authentication using user account credentials is not currently supported in pandas.
Additional information on the authentication mechanism can be found
`here <https://developers.google.com/identity/protocols/OAuth2#clientside/>`__.

Expand All @@ -4695,8 +4727,6 @@ is particularly useful when working on remote servers (eg. jupyter iPython noteb
Additional information on service accounts can be found
`here <https://developers.google.com/identity/protocols/OAuth2#serviceaccount>`__.

You will need to install an additional dependency: `oauth2client <https://github.com/google/oauth2client>`__.

Authentication via ``application default credentials`` is also possible. This is only valid
if the parameter ``private_key`` is not provided. This method also requires that
the credentials can be fetched from the environment the code is running in.
Expand All @@ -4716,6 +4746,7 @@ Additional information on
A private key can be obtained from the Google developers console by clicking
`here <https://console.developers.google.com/permissions/serviceaccounts>`__. Use JSON key type.

.. _io.bigquery_reader:

Querying
''''''''
Expand Down Expand Up @@ -4775,7 +4806,6 @@ For more information about query configuration parameters see

.. _io.bigquery_writer:


Writing DataFrames
''''''''''''''''''

Expand Down Expand Up @@ -4865,6 +4895,8 @@ For example:
often as the service seems to be changing and evolving. BiqQuery is best for analyzing large
sets of data quickly, but it is not a direct replacement for a transactional database.

.. _io.bigquery_create_tables:

Creating BigQuery Tables
''''''''''''''''''''''''

Expand Down Expand Up @@ -4894,6 +4926,7 @@ produce the dictionary representation schema of the specified pandas DataFrame.
the new table with a different name. Refer to
`Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191>`__.


.. _io.stata:

Stata Format
Expand Down
5 changes: 4 additions & 1 deletion doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -369,7 +369,9 @@ Other API Changes
- ``pd.read_csv()`` will now raise a ``ValueError`` for the C engine if the quote character is larger than than one byte (:issue:`11592`)
- ``inplace`` arguments now require a boolean value, else a ``ValueError`` is thrown (:issue:`14189`)
- ``pandas.api.types.is_datetime64_ns_dtype`` will now report ``True`` on a tz-aware dtype, similar to ``pandas.api.types.is_datetime64_any_dtype``
- ``DataFrame.asof()`` will return a null filled ``Series`` instead the scalar ``NaN`` if a match is not found (:issue:`15118`)
- ``DataFrame.asof()`` will return a null filled ``Series`` instead the scalar ``NaN`` if a match is not found (:issue:`15118`)
- The :func:`pd.read_gbq` method now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Otherwise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no more casted to ``int64`` which also caused precision lost (:issue: `14064`, :issue:`14305`).

.. _whatsnew_0200.deprecations:

Deprecations
Expand Down Expand Up @@ -439,6 +441,7 @@ Bug Fixes

- Bug in ``DataFrame.loc`` with indexing a ``MultiIndex`` with a ``Series`` indexer (:issue:`14730`)


- Bug in ``pd.read_msgpack()`` in which ``Series`` categoricals were being improperly processed (:issue:`14901`)
- Bug in ``Series.ffill()`` with mixed dtypes containing tz-aware datetimes. (:issue:`14956`)

Expand Down
24 changes: 13 additions & 11 deletions pandas/io/gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -603,18 +603,14 @@ def _parse_data(schema, rows):
# see:
# http://pandas.pydata.org/pandas-docs/dev/missing_data.html
# #missing-data-casting-rules-and-indexing
dtype_map = {'INTEGER': np.dtype(float),
'FLOAT': np.dtype(float),
# This seems to be buggy without nanosecond indicator
dtype_map = {'FLOAT': np.dtype(float),
'TIMESTAMP': 'M8[ns]'}

fields = schema['fields']
col_types = [field['type'] for field in fields]
col_names = [str(field['name']) for field in fields]
col_dtypes = [dtype_map.get(field['type'], object) for field in fields]
page_array = np.zeros((len(rows),),
dtype=lzip(col_names, col_dtypes))

page_array = np.zeros((len(rows),), dtype=lzip(col_names, col_dtypes))
for row_num, raw_row in enumerate(rows):
entries = raw_row.get('f', [])
for col_num, field_type in enumerate(col_types):
Expand All @@ -628,7 +624,9 @@ def _parse_data(schema, rows):
def _parse_entry(field_value, field_type):
if field_value is None or field_value == 'null':
return None
if field_type == 'INTEGER' or field_type == 'FLOAT':
if field_type == 'INTEGER':
return int(field_value)
elif field_type == 'FLOAT':
return float(field_value)
elif field_type == 'TIMESTAMP':
timestamp = datetime.utcfromtimestamp(float(field_value))
Expand Down Expand Up @@ -757,10 +755,14 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,
'Column order does not match this DataFrame.'
)

# Downcast floats to integers and objects to booleans
# if there are no NaN's. This is presently due to a
# limitation of numpy in handling missing data.
final_df._data = final_df._data.downcast(dtypes='infer')
# cast BOOLEAN and INTEGER columns from object to bool/int
# if they dont have any nulls
type_map = {'BOOLEAN': bool, 'INTEGER': int}
for field in schema['fields']:
if field['type'] in type_map and \
final_df[field['name']].notnull().all():
final_df[field['name']] = \
final_df[field['name']].astype(type_map[field['type']])

connector.print_elapsed_seconds(
'Total time taken',
Expand Down
Loading

0 comments on commit c23b1a4

Please sign in to comment.