Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added query_config parameter to read_gbq #14742

Closed
wants to merge 36 commits into from
Closed

Added query_config parameter to read_gbq #14742

wants to merge 36 commits into from

Conversation

necnec
Copy link
Contributor

@necnec necnec commented Nov 25, 2016

Now more complicated queries could be processed.

Now more complicated queries could be processed.
@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

cc @parthea

accepting kwargs is a more general way of doing this. I don't really want to have to keep adding specific keywords.

Futher would need some tests.

@necnec
Copy link
Contributor Author

necnec commented Nov 28, 2016

@jreback thanks for your notes. I've changed parameters to kwargs style.
Could you give me some examples how to make tests and publish it as you mentioned? I've done it on my local but I think it is not enough.

@necnec necnec changed the title Added udf_resource_uri parameter to read_gbq Added query_config parameter to read_gbq Nov 28, 2016
@codecov-io
Copy link

codecov-io commented Nov 28, 2016

Current coverage is 84.75% (diff: 7.69%)

Merging #14742 into master will decrease coverage by 0.02%

@@             master     #14742   diff @@
==========================================
  Files           145        145          
  Lines         51090      51139    +49   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43315      43344    +29   
- Misses         7775       7795    +20   
  Partials          0          0          

Powered by Codecov. Last update e27b296...3a238a5

@necnec
Copy link
Contributor Author

necnec commented Nov 28, 2016

@jreback do I must write some additional test in https://github.com/pandas-dev/pandas/blob/master/pandas/io/tests/test_gbq.py? As I see current version is passed by bot successfully.
If yes, do I need just write test and push it, or I must make these manipulations http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-google-bigquery-integration-tests before?

@parthea
Copy link
Contributor

parthea commented Nov 28, 2016

do I must write some additional test in https://github.com/pandas-dev/pandas/blob/master/pandas/io/tests/test_gbq.py? As I see current version is passed by bot successfully.

Correct. There should be new unit test(s) added in test_gbq.py

If yes, do I need just write test and push it, or I must make these manipulations http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-google-bigquery-integration-tests before?

Please follow the steps in the link mentioned. Ideally this is done before you push. You only need to do this once and hopefully you won't touch it again, except to change credentials if needed. It should only take 5-10 minutes. Separately, please let me know if any part of the instructions is unclear (I put the instructions together).

@necnec
Copy link
Contributor Author

necnec commented Nov 29, 2016

@parthea thank you for your notes.
I've added 2 new tests based on query config.
I'm not sure about test_query_with_parameters because as I see parameters is an experimental feature now.

BQ testing instruction is quite good, although I have problems with credential and my build crashes on my own Travis-CI tests. I've got an exception

InvalidPrivateKeyFormat: Private key is missing or invalid. It should be service account private key JSON (file path or string contents) with at least two keys: 'client_email' and 'private_key'

although I've put my JSON in single quotes

So, is it enough for my pull-request?

@@ -379,6 +379,7 @@ Google BigQuery Enhancements

- The :func:`read_gbq` method has gained the ``dialect`` argument to allow users to specify whether to use BigQuery's legacy SQL or BigQuery's standard SQL. See the :ref:`docs <io.bigquery_reader>` for more details (:issue:`13615`).
- The :func:`~DataFrame.to_gbq` method now allows the DataFrame column order to differ from the destination table schema (:issue:`11359`).
- The :func:`read_gbq` method now allows query configuration preferences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.19.2

@@ -682,6 +686,13 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,

.. versionadded:: 0.19.0

**kwargs: Arbitrary keyword arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just like using a named argument. what I would like to see is something like:

config = {'query' : ....}
read_gbq(.....,configuration=config)

then just add these keys directly.

Copy link
Contributor Author

@necnec necnec Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Hmm, in this case I can use read_gbq for loading files, isn't it? I mean:

config = {'load' : ....}

and function read_gbq could use not only for reading queries and parameter query seems unnecessary. Could you give me some more motivation about configuration parameter?
I think bigquery jobs like load, copy are outside of pandas philosophy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean. the point here is that if someone wants to specify a configuration option, then they can just pass it thru as a a dict structure. I don't want to have to change pandas code again when someone want 'another' option. These should just pass thru.

Copy link
Contributor Author

@necnec necnec Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback sorry for big response but I haven't got your idea yet:

  1. The things I worry about are arguments query and dialect. It is become redunt.
config = {
  'query' : {
    'query':  'select 1',
    'useLegacySql': dialect == 'legacy'
  }
}
read_gbq(query, dialect, configuration=config)

So, I am going to use such logic: if query not specified in config I will use query parameter of read_gbq. But if query or dialect specified in config do I must throw an exception?

  1. As I got your idea configuration parameter should be like
config = {
  'query' : {
    "useQueryCache": False,
  }
}

Not like:

config = {
  "useQueryCache": False
}

And in this case I could pass:

config = {'load' : ....}

what I should do with query parameter of function read_gbq? just skip it inside?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@necnec I share a similar concern that there can be conflicting settings for the 'useLegacySql' option. One setting could come from the dialect parameter, and the other setting could come from the configuration kwargs. I agree that it may be better to throw an exception if there is a duplicate value for 'useLegacySql' rather than silently ignore the dialect parameter that was specified in read_gbq().

Setting dialect in read_gbq() does not have any effect when 'useLegacySql' is included as part of the configuration. See example below:

from pandas.io import gbq
sql_statement = "SELECT @param1 + @param2 as VALID_RESULT FROM UNNEST([1, 2, 3, 100, 1000])"
config = {
    'query': {
        "useLegacySql": False,
        "parameterMode": "named",
        "queryParameters": [
            {
                "name": "param1",
                "parameterType": {
                    "type": "INTEGER"
                },
                "parameterValue": {
                    "value": 1
                }
            },
            {
                "name": "param2",
                "parameterType": {
                    "type": "INTEGER"
                },
                "parameterValue": {
                    "value": 2
                }
            }
        ]
    }
}
gbq.read_gbq(sql_statement,project_id='xxxxxx', configuration=config, verbose=False, dialect='legacy')

@jorisvandenbossche
Copy link
Member

@necnec sorry to ask you to move it again, but as this is not a critical fix or enhancement, let's keep this for 0.20, so can you move the whatsnew notice to v0.20.0.txt?


.. _whatsnew_0200.gbq:

Google BigQuery Enhancements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need a sub-section, just add to enhancements

Google BigQuery Enhancements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- The :func:`read_gbq` method now allows query configuration preferences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a tiny example, add the issue number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Sorry, but I have some newbie questions
if there is no issue for that feature do I need to create one now? And do I must add tiny example to whatsnew file or to issue file?

@jorisvandenbossche
Copy link
Member

@necnec

@jorisvandenbossche about kwargs instead of configuration you can find there: #14742 (comment)
I think @jreback has better explanation why we use kwargs not configuration

kwargs can be used to pass through a bunch of keywords, but this is a single specific keyword that pandas implements to hold those kwargs, so I think we should document it like that.

Further, the user can now pass other keys in the config object besides 'query', and those will be ignored but no error/warning is raised?

@necnec
Copy link
Contributor Author

necnec commented Dec 22, 2016

@jorisvandenbossche

kwargs can be used to pass through a bunch of keywords, but this is a single specific keyword that pandas implements to hold those kwargs, so I think we should document it like that.

Sorry, I haven't got your idea. Do you think we should use config argument instead of kwargs ? It was my first version but @jreback reasonably ask me to change it to kwargs. I'm ready to add some more documentation if you show me where.

Further, the user can now pass other keys in the config object besides 'query', and those will be ignored but no error/warning is raised?

config allows only 'query' options. Otherwise, it throws an exception. Unit test named test_configuration_without_query check that

@jorisvandenbossche
Copy link
Member

config allows only 'query' options. Otherwise, it throws an exception. Unit test named test_configuration_without_query check that

Yes, but if you pass a config object with 'query' and such a 'copy' key, the 'copy' key will be silently ignored?

@necnec
Copy link
Contributor Author

necnec commented Dec 22, 2016

@jorisvandenbossche Yes it will be ignored. Do you think it should throws an exception in this case?

@jorisvandenbossche
Copy link
Member

@necnec IMO, yes, I would raise an error, that is more informative to the user (or at least raise a warning)

@necnec
Copy link
Contributor Author

necnec commented Dec 30, 2016

@jorisvandenbossche I've added exception if you pass 2 types of job in config. But I haven't added additional unit test.
So after test this feature on my local I receive exception something like this:
ValueError: Only one job type must be specified, but given query,load
I'm not sure that message looks well so I'll happy if you help me to make it more clear.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

@necnec the reason I thought it a nice idea to allow arbitrary kwargs to be passed is that I don't want pandas having to add lots of kwargs to the signature when they are simply passed thru.

does this assertion still hold? is this useful?

@necnec
Copy link
Contributor Author

necnec commented Dec 31, 2016

Yes, I think adding kwargs is a good idea to pass arbitrary keys

@@ -4649,6 +4649,20 @@ destination DataFrame as well as a preferred column order as follows:
index_col='index_column_name',
col_order=['col1', 'col2', 'col3'], projectid)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add starting in 0.20.0 (or you can add a versionadded tag)


You can specify the query config as parameter

.. code-block:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say why this is useful as well. If you have a doc-link to things that you might want to pass here, pls add it.

}
}
config = kwargs.get('config')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what you are doing here (and why)


config = {'query': {'useQueryCache': False}}

For more information see `BigQuery SQL Reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is a good reference, add this above where I indicated

@jreback
Copy link
Contributor

jreback commented Dec 31, 2016

@necnec just a couple of doc comments. ping when green.

config = {'query': {'useQueryCache': False}}

For more information see `BigQuery SQL Reference
<https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no indentation relative to "For more ...) is needed here (otherwise possibly will give errors when building the docs)

@jorisvandenbossche
Copy link
Member

@jreback Repeating my question from above: I know you asked to rename the kwarg from configuration to config, but as this is to specifically pass the 'configuration' from GBQ terminology, shouldn't we rather keep it consistent with their naming and use 'configuration' as well?

@jreback
Copy link
Contributor

jreback commented Jan 2, 2017

I just though config is more in-line with what is actually used in the JSON (and a bit shorter). If we are actually using configuration (to pass the data) then that would be fine too.

@jorisvandenbossche
Copy link
Member

@jreback yes, the key in the dict / json that is passed is actually 'configuration', therefore I would use that.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2017

@necnec ok pls make the adjustment from config -> configuration as @jorisvandenbossche indicates. ping on green.

@jreback jreback closed this in ff3c464 Jan 3, 2017
@jreback
Copy link
Contributor

jreback commented Jan 3, 2017

thanks @necnec

sometimes these things takes time. thanks for the patience.

@necnec
Copy link
Contributor Author

necnec commented Jan 3, 2017

Thank you too for this experience! I'm glad to be participated in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants