Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added query_config parameter to read_gbq #14742

Closed
wants to merge 36 commits into from
Closed
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
55bf05c
Added udf_resource_uri parameter to read_gbq
necnec Nov 25, 2016
dad9288
Change parameter to kwargs
necnec Nov 28, 2016
9a16a8c
Merge branch 'bigquery-udf-resources'
necnec Nov 28, 2016
f9fae0c
Fix formatting
necnec Nov 28, 2016
42dc9e6
Merge remote-tracking branch 'origin/bigquery-udf-resources'
necnec Nov 28, 2016
c66169d
add read_gbq tests: query parameters and cache
necnec Nov 28, 2016
a96811d
add unit tests read_gbq: query parameters, cache
necnec Nov 28, 2016
ad35a43
fix whatsnew text
necnec Nov 28, 2016
ddb4fd1
Merge branch 'bigquery-udf-resources'
necnec Nov 28, 2016
94fa514
test formatting
necnec Nov 29, 2016
d69ed7f
check tests
necnec Nov 29, 2016
834a2ff
Merge branch 'bigquery-udf-resources'
necnec Nov 29, 2016
640be7a
Change whatnew 0.19.0->0.19.2
necnec Nov 30, 2016
b849300
Change whatsnew 0.19.2 -> 0.20.0
necnec Nov 30, 2016
a952710
Move whatsnew BQ Enhancements -> Enhancements
necnec Dec 2, 2016
0b365da
delete newlines
necnec Dec 2, 2016
c199935
Make query configuration more general
necnec Dec 5, 2016
028c8be
Solve formating problems
necnec Dec 5, 2016
ce8ebe4
Merge branch 'bigquery-udf-resources'
necnec Dec 5, 2016
146f0f3
Merge branch 'master' into bigquery-udf-resources
necnec Dec 5, 2016
8fe77b2
Merge branch 'bigquery-udf-resources'
necnec Dec 5, 2016
c21588a
Merge remote-tracking branch 'pandas-dev/master' into bigquery-udf-re…
necnec Dec 12, 2016
395c0e9
fix formatting
necnec Dec 12, 2016
8a38650
Added example configuration & job_configuration refactoring
necnec Dec 12, 2016
929ad1a
formatting: delete whitespace
necnec Dec 13, 2016
86ed96d
Merge branch 'master' into bigquery-udf-resources
necnec Dec 13, 2016
0ac26a2
added pull request number in whitens
necnec Dec 14, 2016
99521aa
Formatting, documentation, new unit test
necnec Dec 21, 2016
df5dec6
configuration->config & formatting
necnec Dec 22, 2016
8720b03
Delete trailing whitespaces
necnec Dec 22, 2016
ec590af
Throw exception if more than 1 job type in config
necnec Dec 29, 2016
2e02d76
Merge remote-tracking branch 'pandas-dev/master' into bigquery-udf-re…
necnec Dec 29, 2016
e2f801f
hotfix
Dec 29, 2016
b97a1be
formatting
necnec Dec 30, 2016
82f4409
Add some documentation & formatting
necnec Jan 2, 2017
3a238a5
config->configuration
necnec Jan 3, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4562,6 +4562,20 @@ destination DataFrame as well as a preferred column order as follows:
index_col='index_column_name',
col_order=['col1', 'col2', 'col3'], projectid)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add starting in 0.20.0 (or you can add a versionadded tag)

You can specify the query config as parameter

.. code-block:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say why this is useful as well. If you have a doc-link to things that you might want to pass here, pls add it.


config = {
'query': {
"useQueryCache": False
}
}
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
config=config, projectid)


.. note::

You can find your project id in the `Google developers console <https://console.developers.google.com>`__.
Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ Other enhancements

- ``pd.read_excel`` now preserves sheet order when using ``sheetname=None`` (:issue:`9930`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a mini section to the docs and put a pointer here (or is the doc-string enough)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this PR number as the issue number


- ``pd.read_gbq`` method now allows query configuration preferences (:issue:`14742`)

- New ``UnsortedIndexError`` (subclass of ``KeyError``) raised when indexing/slicing into an
unsorted MultiIndex (:issue:`11897`). This allows differentiation between errors due to lack
of sorting or an incorrect key. See :ref:`here <advanced.unsorted>`
Expand Down
50 changes: 38 additions & 12 deletions pandas/io/gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -375,7 +375,7 @@ def process_insert_errors(self, insert_errors):

raise StreamingInsertError

def run_query(self, query):
def run_query(self, query, **kwargs):
try:
from googleapiclient.errors import HttpError
except:
Expand All @@ -385,16 +385,30 @@ def run_query(self, query):
_check_google_client_version()

job_collection = self.service.jobs()
job_data = {
'configuration': {
'query': {
'query': query,
'useLegacySql': self.dialect == 'legacy'
# 'allowLargeResults', 'createDisposition',
# 'preserveNulls', destinationTable, useQueryCache
}

job_config = {
'query': {
'query': query,
'useLegacySql': self.dialect == 'legacy'
# 'allowLargeResults', 'createDisposition',
# 'preserveNulls', destinationTable, useQueryCache
}
}
config = kwargs.get('config')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what you are doing here (and why)

if config is not None:
if 'query' in config:
if 'query' in config['query'] and query is not None:
raise ValueError("Query statement can't be specified "
"inside config while it is specified "
"as parameter")

job_config['query'].update(config['query'])
else:
raise ValueError("Only 'query' job type is supported")

job_data = {
'configuration': job_config
}

self._start_timer()
try:
Expand Down Expand Up @@ -622,8 +636,9 @@ def _parse_entry(field_value, field_type):


def read_gbq(query, project_id=None, index_col=None, col_order=None,
reauth=False, verbose=True, private_key=None, dialect='legacy'):
"""Load data from Google BigQuery.
reauth=False, verbose=True, private_key=None, dialect='legacy',
**kwargs):
r"""Load data from Google BigQuery.

THIS IS AN EXPERIMENTAL LIBRARY

Expand Down Expand Up @@ -682,6 +697,17 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,

.. versionadded:: 0.19.0

**kwargs : Arbitrary keyword arguments
config (dict): query config parameters for job processing.
For example:

config = {'query': {'useQueryCache': False}}

For more information see `BigQuery SQL Reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is a good reference, add this above where I indicated

<https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no indentation relative to "For more ...) is needed here (otherwise possibly will give errors when building the docs)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put the mini example here

.. versionadded:: 0.20.0

Returns
-------
df: DataFrame
Expand All @@ -698,7 +724,7 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,
connector = GbqConnector(project_id, reauth=reauth, verbose=verbose,
private_key=private_key,
dialect=dialect)
schema, pages = connector.run_query(query)
schema, pages = connector.run_query(query, **kwargs)
dataframe_list = []
while len(pages) > 0:
page = pages.pop()
Expand Down
85 changes: 85 additions & 0 deletions pandas/io/tests/test_gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -711,6 +711,91 @@ def test_invalid_option_for_sql_dialect(self):
gbq.read_gbq(sql_statement, project_id=_get_project_id(),
dialect='standard', private_key=_get_private_key_path())

def test_query_with_parameters(self):
sql_statement = "SELECT @param1 + @param2 as VALID_RESULT"
config = {
'query': {
"useLegacySql": False,
"parameterMode": "named",
"queryParameters": [
{
"name": "param1",
"parameterType": {
"type": "INTEGER"
},
"parameterValue": {
"value": 1
}
},
{
"name": "param2",
"parameterType": {
"type": "INTEGER"
},
"parameterValue": {
"value": 2
}
}
]
}
}
# Test that a query that relies on parameters fails
# when parameters are not supplied via configuration
with tm.assertRaises(ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary? I thought configuration is an optional paramter? when is it needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Yes, configuration is optional. But this unit test is very special. It processes query with parameters. And in this case you must pass parameters values in configuration.

I've made 2 unit tests. So if you think this test if very special I can remove that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its fine to test. is it seems that this tests means its required somehow though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback so I don't need to change anything here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update the comment to better explain why we expect a failure here?

For example,
# Test that a query that relies on parameters fails when parameters not supplied via configuration.

gbq.read_gbq(sql_statement, project_id=_get_project_id(),
private_key=_get_private_key_path())

# Test that the query is successful because we have supplied
# the correct query parameters via the 'config' option
df = gbq.read_gbq(sql_statement, project_id=_get_project_id(),
private_key=_get_private_key_path(),
config=config)
tm.assert_frame_equal(df, DataFrame({'VALID_RESULT': [3]}))

def test_query_inside_configuration(self):
query_no_use = 'SELECT "PI_WRONG" as VALID_STRING'
query = 'SELECT "PI" as VALID_STRING'
config = {
'query': {
"query": query,
"useQueryCache": False,
}
}
# Test that it can't pass query both
# inside config and as parameter
with tm.assertRaises(ValueError):
gbq.read_gbq(query_no_use, project_id=_get_project_id(),
private_key=_get_private_key_path(),
config=config)

df = gbq.read_gbq(None, project_id=_get_project_id(),
private_key=_get_private_key_path(),
config=config)
tm.assert_frame_equal(df, DataFrame({'VALID_STRING': ['PI']}))

def test_configuration_without_query(self):
sql_statement = 'SELECT 1'
config = {
'copy': {
"sourceTable": {
"projectId": _get_project_id(),
"datasetId": "publicdata:samples",
"tableId": "wikipedia"
},
"destinationTable": {
"projectId": _get_project_id(),
"datasetId": "publicdata:samples",
"tableId": "wikipedia_copied"
},
}
}
# Test that only 'query' configurations are supported
# nor 'copy','load','extract'
with tm.assertRaises(ValueError):
gbq.read_gbq(sql_statement, project_id=_get_project_id(),
private_key=_get_private_key_path(),
config=config)


class TestToGBQIntegration(tm.TestCase):
# Changes to BigQuery table schema may take up to 2 minutes as of May 2015
Expand Down