Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added query_config parameter to read_gbq #14742

Closed
wants to merge 36 commits into from
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
55bf05c
Added udf_resource_uri parameter to read_gbq
necnec Nov 25, 2016
dad9288
Change parameter to kwargs
necnec Nov 28, 2016
9a16a8c
Merge branch 'bigquery-udf-resources'
necnec Nov 28, 2016
f9fae0c
Fix formatting
necnec Nov 28, 2016
42dc9e6
Merge remote-tracking branch 'origin/bigquery-udf-resources'
necnec Nov 28, 2016
c66169d
add read_gbq tests: query parameters and cache
necnec Nov 28, 2016
a96811d
add unit tests read_gbq: query parameters, cache
necnec Nov 28, 2016
ad35a43
fix whatsnew text
necnec Nov 28, 2016
ddb4fd1
Merge branch 'bigquery-udf-resources'
necnec Nov 28, 2016
94fa514
test formatting
necnec Nov 29, 2016
d69ed7f
check tests
necnec Nov 29, 2016
834a2ff
Merge branch 'bigquery-udf-resources'
necnec Nov 29, 2016
640be7a
Change whatnew 0.19.0->0.19.2
necnec Nov 30, 2016
b849300
Change whatsnew 0.19.2 -> 0.20.0
necnec Nov 30, 2016
a952710
Move whatsnew BQ Enhancements -> Enhancements
necnec Dec 2, 2016
0b365da
delete newlines
necnec Dec 2, 2016
c199935
Make query configuration more general
necnec Dec 5, 2016
028c8be
Solve formating problems
necnec Dec 5, 2016
ce8ebe4
Merge branch 'bigquery-udf-resources'
necnec Dec 5, 2016
146f0f3
Merge branch 'master' into bigquery-udf-resources
necnec Dec 5, 2016
8fe77b2
Merge branch 'bigquery-udf-resources'
necnec Dec 5, 2016
c21588a
Merge remote-tracking branch 'pandas-dev/master' into bigquery-udf-re…
necnec Dec 12, 2016
395c0e9
fix formatting
necnec Dec 12, 2016
8a38650
Added example configuration & job_configuration refactoring
necnec Dec 12, 2016
929ad1a
formatting: delete whitespace
necnec Dec 13, 2016
86ed96d
Merge branch 'master' into bigquery-udf-resources
necnec Dec 13, 2016
0ac26a2
added pull request number in whitens
necnec Dec 14, 2016
99521aa
Formatting, documentation, new unit test
necnec Dec 21, 2016
df5dec6
configuration->config & formatting
necnec Dec 22, 2016
8720b03
Delete trailing whitespaces
necnec Dec 22, 2016
ec590af
Throw exception if more than 1 job type in config
necnec Dec 29, 2016
2e02d76
Merge remote-tracking branch 'pandas-dev/master' into bigquery-udf-re…
necnec Dec 29, 2016
e2f801f
hotfix
Dec 29, 2016
b97a1be
formatting
necnec Dec 30, 2016
82f4409
Add some documentation & formatting
necnec Jan 2, 2017
3a238a5
config->configuration
necnec Jan 3, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -83,3 +83,10 @@ Performance Improvements

Bug Fixes
~~~~~~~~~

.. _whatsnew_0200.gbq:

Google BigQuery Enhancements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need a sub-section, just add to enhancements

^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- The :func:`read_gbq` method now allows query configuration preferences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a tiny example, add the issue number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Sorry, but I have some newbie questions
if there is no issue for that feature do I need to create one now? And do I must add tiny example to whatsnew file or to issue file?

17 changes: 14 additions & 3 deletions pandas/io/gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -375,7 +375,7 @@ def process_insert_errors(self, insert_errors):

raise StreamingInsertError

def run_query(self, query):
def run_query(self, query, **kwargs):
try:
from googleapiclient.errors import HttpError
except:
Expand All @@ -395,6 +395,9 @@ def run_query(self, query):
}
}
}
query_config = kwargs.get('query_config')
if query_config is not None:
job_data['configuration']['query'].update(query_config)

self._start_timer()
try:
Expand Down Expand Up @@ -622,7 +625,8 @@ def _parse_entry(field_value, field_type):


def read_gbq(query, project_id=None, index_col=None, col_order=None,
reauth=False, verbose=True, private_key=None, dialect='legacy'):
reauth=False, verbose=True, private_key=None, dialect='legacy',
**kwargs):
"""Load data from Google BigQuery.

THIS IS AN EXPERIMENTAL LIBRARY
Expand Down Expand Up @@ -682,6 +686,13 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,

.. versionadded:: 0.19.0

**kwargs: Arbitrary keyword arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just like using a named argument. what I would like to see is something like:

config = {'query' : ....}
read_gbq(.....,configuration=config)

then just add these keys directly.

Copy link
Contributor Author

@necnec necnec Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Hmm, in this case I can use read_gbq for loading files, isn't it? I mean:

config = {'load' : ....}

and function read_gbq could use not only for reading queries and parameter query seems unnecessary. Could you give me some more motivation about configuration parameter?
I think bigquery jobs like load, copy are outside of pandas philosophy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean. the point here is that if someone wants to specify a configuration option, then they can just pass it thru as a a dict structure. I don't want to have to change pandas code again when someone want 'another' option. These should just pass thru.

Copy link
Contributor Author

@necnec necnec Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback sorry for big response but I haven't got your idea yet:

  1. The things I worry about are arguments query and dialect. It is become redunt.
config = {
  'query' : {
    'query':  'select 1',
    'useLegacySql': dialect == 'legacy'
  }
}
read_gbq(query, dialect, configuration=config)

So, I am going to use such logic: if query not specified in config I will use query parameter of read_gbq. But if query or dialect specified in config do I must throw an exception?

  1. As I got your idea configuration parameter should be like
config = {
  'query' : {
    "useQueryCache": False,
  }
}

Not like:

config = {
  "useQueryCache": False
}

And in this case I could pass:

config = {'load' : ....}

what I should do with query parameter of function read_gbq? just skip it inside?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@necnec I share a similar concern that there can be conflicting settings for the 'useLegacySql' option. One setting could come from the dialect parameter, and the other setting could come from the configuration kwargs. I agree that it may be better to throw an exception if there is a duplicate value for 'useLegacySql' rather than silently ignore the dialect parameter that was specified in read_gbq().

Setting dialect in read_gbq() does not have any effect when 'useLegacySql' is included as part of the configuration. See example below:

from pandas.io import gbq
sql_statement = "SELECT @param1 + @param2 as VALID_RESULT FROM UNNEST([1, 2, 3, 100, 1000])"
config = {
    'query': {
        "useLegacySql": False,
        "parameterMode": "named",
        "queryParameters": [
            {
                "name": "param1",
                "parameterType": {
                    "type": "INTEGER"
                },
                "parameterValue": {
                    "value": 1
                }
            },
            {
                "name": "param2",
                "parameterType": {
                    "type": "INTEGER"
                },
                "parameterValue": {
                    "value": 2
                }
            }
        ]
    }
}
gbq.read_gbq(sql_statement,project_id='xxxxxx', configuration=config, verbose=False, dialect='legacy')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after 'kwargs' (before the colon, this is a numpydoc peculiarity)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the ** in the kwargs will give an error when building the docs (as this is used to make something bold).

If you want to keep them, you need to make the docstring a raw string by adding a 'r' at the front (so r""")

query_config (dict): query configuration parameters for job processing.
For more information see `BigQuery SQL Reference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent this also one level less (same level as the "For example .."

<https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no indentation relative to "For more ...) is needed here (otherwise possibly will give errors when building the docs)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put the mini example here

.. versionadded:: 0.20.0

Returns
-------
df: DataFrame
Expand All @@ -698,7 +709,7 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,
connector = GbqConnector(project_id, reauth=reauth, verbose=verbose,
private_key=private_key,
dialect=dialect)
schema, pages = connector.run_query(query)
schema, pages = connector.run_query(query, **kwargs)
dataframe_list = []
while len(pages) > 0:
page = pages.pop()
Expand Down
49 changes: 48 additions & 1 deletion pandas/io/tests/test_gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -707,10 +707,57 @@ def test_invalid_option_for_sql_dialect(self):
private_key=_get_private_key_path())

# Test that a correct option for dialect succeeds
# to make sure ValueError was due to invalid dialect
# to make sure ValueError was due to invalid dialect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert the extra space before dialect here

gbq.read_gbq(sql_statement, project_id=_get_project_id(),
dialect='standard', private_key=_get_private_key_path())

def test_query_with_parameters(self):
sql_statement = "SELECT @param1 + @param2 as VALID_RESULT"
query_config = {
"useLegacySql": False,
"parameterMode": "named",
"queryParameters": [
{
"name": "param1",
"parameterType": {
"type": "INTEGER"
},
"parameterValue": {
"value": 1
}
},
{
"name": "param2",
"parameterType": {
"type": "INTEGER"
},
"parameterValue": {
"value": 2
}
}
]
}
# Test that an invalid query without query_config
with tm.assertRaises(ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary? I thought configuration is an optional paramter? when is it needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Yes, configuration is optional. But this unit test is very special. It processes query with parameters. And in this case you must pass parameters values in configuration.

I've made 2 unit tests. So if you think this test if very special I can remove that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its fine to test. is it seems that this tests means its required somehow though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback so I don't need to change anything here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update the comment to better explain why we expect a failure here?

For example,
# Test that a query that relies on parameters fails when parameters not supplied via configuration.

gbq.read_gbq(sql_statement, project_id=_get_project_id(),
private_key=_get_private_key_path())

# Test that a correct query with query config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Test that the query is successful because we have supplied the correct query parameters via the 'configuration' option

df = gbq.read_gbq(sql_statement, project_id=_get_project_id(),
private_key=_get_private_key_path(),
query_config=query_config)
tm.assert_frame_equal(df, DataFrame({'VALID_RESULT': [3]}))

def test_query_no_cache(self):
query = 'SELECT "PI" as VALID_STRING'
query_config = {
"useQueryCache": False,
}
df = gbq.read_gbq(query, project_id=_get_project_id(),
private_key=_get_private_key_path(),
query_config=query_config)
tm.assert_frame_equal(df, DataFrame({'VALID_STRING': ['PI']}))


class TestToGBQIntegration(tm.TestCase):
# Changes to BigQuery table schema may take up to 2 minutes as of May 2015
Expand Down