ENH: Google BigQuery IO Module #4140

sean-schaefer · 2013-07-05T18:00:00Z

This module adds Google BigQuery functionality to the pandas project. The work is based on the existing sql.py and ga.py, providing methods to query a BigQuery database and read the results into a pandas DataFrame.

cpcloud · 2013-07-05T19:31:50Z

pandas/io/gbq.py

+        credentials: SignedJwtAssertionCredentials object
+        """
+        try: 
+            creds_file = open(self.private_key, 'r')   


you should use a context manager here. if the next line fails then private_key will not be closed.

cpcloud · 2013-07-05T19:38:31Z

Also, is there any way to test this?

jreback · 2013-07-05T19:44:37Z

needs docs too (release, whatsnew, io)

@cpcloud FYI ...need ga docs too

since I don't use these, what exactly does this do?

cpcloud · 2013-07-05T19:48:10Z

me neither, i created an issue a while ago to make ga docs...i guess low-prio for now.

bigquery is a relational db engine used for querying gigantic datasets. this looks like it puts a pandas wrapper around the google bigquery query language

cpcloud · 2013-07-05T19:54:06Z

i would say that based on the experience with the historical finance APIs that we should be very careful about network-ish APIs, insofar as they are difficult to test reliably. just my 2c

sean-schaefer · 2013-07-05T20:58:15Z

We have test cases, but they're rather specific to our own usage. The problem is that the tests require our credentials, as well as having known datasets to compare results with. There are sample datasets provided by Google, though you'll still need an account.

Thanks for the other suggestions, we'll work on these and update when we finish.

sean-schaefer · 2013-07-05T23:32:30Z

We updated the script with some of your error handling suggestions. Could you be more specific about what you need for documentation?

jreback · 2013-07-05T23:52:43Z

A usage example in here: http://pandas.pydata.org/pandas-docs/dev/io.html#data-reader (or prob create another section to include ga as well). can you code-block to show how to do it (which doesn't actually execute)

also a blurb (in enhancements) for release notes (doc/source/release.rst)

sean-schaefer · 2013-07-09T16:08:55Z

Unfortunately, we did not include use cases for ga as well because we are not familiar with it, but we did write documentation for gbq in the io.rst and release.rst files. We also refactored the original script to improve ease of testing and committed a test suite that we've been using. There are several test cases that do require BigQuery credentials, so they will be skipped unless you hardcode those values into the script. The other tests use CSVs that we've included in the appropriate data directory.

Please let us know any other suggestions / requirements you have.

ghost · 2013-07-12T17:44:02Z

Does anyone have more thoughts on this? During testing, we've noticed there are occasionally problems with OSX using Google's Python API (in particular the OpenSSL/PyCrypto modules) - we're investigating, since this will also affect the GA module, but it seems this may be out of our control. Otherwise, it's been fairly stable.

sean-schaefer · 2013-07-17T23:58:33Z

When using our gbq module internally, we found that there are OpenSSL issues across platforms using the authentication method we employed. Although it works well on my local Ubuntu 12.04 system, we've had difficulty getting it to work on Snow Leopard and Windows 7.

There is another form of authentication through oauth2client, but we rejected that option because it cannot run headless as we plan on running this on an EC2 instance (users are required to grant access to the Google API through a browser window that pops-up during execution). However, for use through this library, do you feel it would be better to authenticate in that manner and make it more suitable cross-platform? This is a question we are currently considering for internal use as well.

jtratner · 2013-07-18T00:19:17Z

doc/source/io.rst

@@ -820,7 +812,7 @@ rows will skip the interveaing rows.
   print open('mi.csv').read()
   pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1],tupleize_cols=False)

-Note: The default behavior in 0.12 remains unchanged (``tupleize_cols=True``),
+Note: The default behavior in 0.11.1 remains unchanged (``tupleize_cols=True``),


Did you force this to be your doc when rebasing? Looks like a number of version changes...

ghost · 2013-07-18T11:48:35Z

Although not spelled out in our various dev docs, the established norm in pandas is to
avoid placing author names in source code. This applies to everyone from wesm on down.

Contributors large and small so far have accepted git log/blame as suitable credit, and
this can be hereby elevated to the level of a project dictum IMO.

I (and I believe other pandai) feel very strongly about religiously maintaining attribution and giving
proper credit in OSS projects. The accepted form for this is an acknowledgment in the release notes
by name or GH handle. We often do this unbidden when new contributors are involved to show
our gratitude for their contribution to the project.

So, please remove the Authors line from the code and feel free to add a "thanks to @sean-schaefer, Jacob Schaer"
note to the appropriate item in RELEASE.rst file.

azbones · 2013-07-18T14:42:11Z

As far as whether to use OpenSSL or oauth2, I would vote for oauth2. The ga.py submodule uses this currently, many users are using Pandas in iPython, and it is significantly easier to install given it doesn't have all the platform specific OpenSSL dependencies. I spent quite a bit of time trying to make this work during testing and it was quite the hassle when using OpenSSL.

Also, I will volunteer to put the documentation together. I'll just need to research how to do that as I've never contributed documentation before...

wesm · 2013-07-18T21:30:47Z

The test file you added is pretty large. Can you trim it down in size?

cpcloud · 2013-07-21T00:17:20Z

pandas/io/gbq.py

+        raise
+
+    # If execution gets here, the file could not be opened
+    raise IOError


maybe put the comment text in the IOError message

cpcloud · 2013-07-21T00:31:43Z

Docs on the other python libraries needed to use this would be great!

sean-schaefer · 2013-07-22T18:58:07Z

We made a few changes as suggested by @cpcloud; thank you for the feedback. We're not sure where you want documentation on library dependencies, but we did add dependencies as a note in io.rst. It should be noted that there is presently a bug in BigQuery that is preventing our module from being 100% reliable, see:

http://stackoverflow.com/questions/17751944/google-bigquery-incomplete-query-replies-on-odd-attempts/

We are planning to implement pagination for the results so that datasets can be much larger and responses more reliable.

jreback · 2013-07-22T19:01:43Z

is httplib2 kind of like the request library?

these deps need go in: http://pandas.pydata.org/pandas-docs/dev/install.html#optional-dependencies

and you should also have tests that skip if these are not installed

sean-schaefer · 2013-07-22T20:19:15Z

Yes, httplib2 is Google's extension of the original httplib library. I imagine request could do the same thing, but the examples and recommendations with the BigQuery API was to use httplib2.

We added the third party libraries to the install.rst, and added the import skip tests.

jreback · 2013-07-22T20:28:44Z

you might already have this, but on the call to the main api, read_gbq? need to raise with the deps if you don't find them. generally try the imports at the top of the module and then set a variable to indicate if they are successful, but don't raise (as the api files import your main module), only raise when the user tries to use

see core/expressions.py for an example of doing this (this actually falls back to using diffferent features, but its the same idea)

basically a user will try out your function and be like, hey I need these deps....(rather than having pandas auto install them or failing when pandas is imported, after all they are not required)

jreback · 2013-10-08T23:12:48Z

@jtratner @cpcloud any final comments?

@sean-schaefer @jacobschaer i'll rebase this when I put it in...

jacobschaer · 2013-10-08T23:31:42Z

@jreback - Sorry I'm a little slow when it comes to git. I am pretty sure it's the way you want it now. I did seperate docs and ci stuff from the rest though.

jreback · 2013-10-08T23:36:30Z

@jacobschaer yes...looks fine now....

jreback · 2013-10-08T23:45:36Z

all I need to install is: easy_install bigquery right?

[sheep-jreback-~] bq
Traceback (most recent call last):
  File "/usr/local/bin/bq", line 8, in <module>
    load_entry_point('bigquery==2.0.15', 'console_scripts', 'bq')()
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 318, in load_entry_point
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 2221, in load_entry_point
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1954, in load
  File "build/bdist.linux-x86_64/egg/bq.py", line 39, in <module>
  File "build/bdist.linux-x86_64/egg/bigquery_client.py", line 26, in <module>
ImportError: cannot import name discovery

jacobschaer · 2013-10-09T00:20:18Z

@jreback I thought that's all I installed. Per:
http://code.google.com/p/google-bigquery-tools/source/browse/bq/README.txt

easy_install bigquery

What interpreter are you using? This might be related to:
#5116

jreback · 2013-10-09T00:22:36Z

got it sorted...had an old version installed.....

bombs away....

jreback · 2013-10-09T00:28:28Z

merged via these commits:

390a2d6
2c60400
6ee748e

odd that github didn't close this PR....but in master now

jreback · 2013-10-09T00:31:23Z

@sean-schaefer @jacobschaer

thanks for all of your nice work on this!

the pandas/big community will be happy!

and you get to support in perpetuity !

check out docs (built tomorrow by 5 est).

also checkout master and test...if anything pls submit a followup PR

cpcloud · 2013-10-09T00:37:20Z

one more step toward a pandopoly!

jreback · 2013-10-09T00:38:33Z

Is this supposed to test on under py3?
does it work in py3? I seem to remember you testing/saying it was working?

jacobschaer · 2013-10-09T00:48:40Z

@jreback I have not tested under Python 3, and from some tests a while ago it seemed that it can't be supported in this version of bq.py. The issue appears to be related to their handling of unicode.

@cpcloud Soon...

jreback · 2013-10-09T00:53:37Z

ok can u test and see if need a nice failure message?

jacobschaer · 2013-10-09T06:58:53Z

@jreback - I'll see what I can do. I've been using a mac and have been hesitant to do Python 3.

@azbones - Good news, Google is actively working on the 100k result bug. As I understand it, no changes will need to be made to our code as it's entirely backend. We can then uncomment the test I made for this situation.

http://stackoverflow.com/questions/19145587/bq-py-not-paging-results

jacobschaer · 2013-10-09T18:43:07Z

@jreback Do we have directions for getting this up and running on Python 3? We kept running into troubles with cython and various dependencies.

jreback · 2013-10-09T18:59:10Z

what do you mean, installing py3, pandas? or bq? py3? you are on linux, right?

jacobschaer · 2013-10-09T19:03:22Z

I was trying to test on Linux, yes. I had a setup Ubuntu virtual machine, and we got python 3 installed, but were having some problems building our repository.
We did something to the effect of:

apt-get install python3, python3-dev, python-pandas, git, cython
git clone https://github.com/sean-schaefer/pandas.git
python pandas/setup.py develop

cpcloud · 2013-10-09T19:04:03Z

u need to be in the pandas directory...then run python setup.py develop

jacobschaer · 2013-10-09T19:18:27Z

... sorry, that's what we did. I was just typing this up from memory. We were getting errors like in:
#2439 (comment)

We were also having trouble getting all the python3 versions of things from apt-get.

cpcloud · 2013-10-09T19:20:21Z

is there maybe a cython3 that you need instead?

jreback · 2013-10-09T19:21:45Z

@jacobschaer I use pip3 for most of the pandas installs (e.g. dateutil,cython)

jacobschaer · 2013-10-09T21:08:28Z

On Ubuntu 12.4, we did roughly the following...

sudo apt-get install python3, python3-dev, cython, python3-setuptools
sudo easy_install3 pip
...
pip3-2 install git+https://github.com/sean-schaefer/pandas.git

We eventually got:

Exception: Cython-generated file 'pandas/index.c' not found ...

See:
https://gist.github.com/clayton/c658f4d9e20afc635e35

jtratner · 2013-10-09T21:36:23Z

I believe you can actually compile with Python 2 Cython and use with Py 3
On Oct 9, 2013 5:08 PM, "jacobschaer" [email protected] wrote:

On Ubuntu 12.4, we did roughly the following...

sudo apt-get install python3, python3-dev, cython, python3-setuptools
sudo easy_install3 pip
...
pip3-2 install git+https://github.com/sean-schaefer/pandas.git

We eventually got:

Exception: Cython-generated file 'pandas/index.c' not found ...

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4140#issuecomment-26008906
.

jreback · 2013-10-09T22:24:04Z

@sean-schaefer @jacobschaer

docs are built: http://pandas.pydata.org/pandas-docs/dev/io.html#google-bigquery-experimental

need a little blurb/example for the v0.13.0 announcements (you can paste here or put in a new PR ....just need a short section in experimental, can be the same example in the docs, this is just to give users a quick taste.)
I would remove this part of the docs:

The general structure of this module and its provided functions are based loosely on those in

pandas.io.sql.

I think you can add a to_gbq method in core/generic.py that just calls gbq.to_gbq(self....)...similar to how the other to_**** methods work (e.g. see to_hdf)
pls add the to/read gbq to doc/source/api.rst

all of these (plus any p3k changes) can be rolled into a single PR

thanks

jacobschaer · 2013-10-10T20:58:34Z

What kind of blurb/example would you want, and what file should it go in? I came up with a moderately interesting example:

query = """SELECT station_number as STATION, month as MONTH, AVG(mean_temp) as MEAN_TEMP FROM publicdata:samples.gsod
WHERE YEAR = 2000 
GROUP BY STATION, MONTH 
ORDER BY STATION, MONTH ASC"""

df = gbq.read_gbq(query)
df2 = df.pivot(index='STATION', columns='MONTH', values='MEAN_TEMP')
df3 = pandas.concat([df2.min(), df2.mean(), df2.max()], axis=1,keys=["Min Tem", "Mean Temp", "Max Temp"])

Yields the monthly min, mean, and max US Temperatures for the year 2000 using NOAA gsod data.

         Min Tem  Mean Temp    Max Temp
MONTH                                  
1     -53.336667  39.827892   89.770968
2     -49.837500  43.685219   93.437932
3     -77.926087  48.708355   96.099998
4     -82.892858  55.070087   97.317240
5     -92.378261  61.428117  102.042856
6     -77.703334  65.858888  102.900000
7     -87.821428  68.169663  106.510714
8     -89.431999  68.614215  105.500000
9     -86.611112  63.436935  107.142856
10    -78.209677  56.880838   92.103333
11    -50.125000  48.861228   94.996428
12    -50.332258  42.286879   94.396774

As far as a blurb, perhaps:
The gbq module provides a simple way to extract from and load data into Google's BigQuery Data Sets by way of pandas DataFrames. BigQuery is a high performance SQL-like database service, useful for performing ad-hoc queries against extremely large datasets.

I'll make those doc changes and put them in a seperate pull request. I still have not been able to successfully test pandas in python 3. I should be able to get the other two changes done pretty quickly today.

jreback · 2013-10-10T21:01:30Z

@jacobschaer is that an example for a public/sample dataset? e.g. that could in theory reproduced by a user? (those are the best kind!) put a small example in doc/source/v0.13.0.txt in experimental section (fyi...be sure to pull from master as there were some edits today)

you can put larger/examples edit docs at your leisure......

jacobschaer · 2013-10-10T21:13:17Z

Will do. Those are public sample datasets provided on all GBQ accounts. As long as they have a BigQuery account that has API access (I think accounts not fully setup only get web access), this will take them through the auth process if they haven't already and then work perfectly.

jacobschaer · 2013-10-10T22:27:54Z

#5179

cpcloud reviewed Jul 5, 2013
View reviewed changes

jtratner reviewed Jul 18, 2013
View reviewed changes

ghost mentioned this pull request Jul 18, 2013

Update CONTRIBUTING.md with note on attribution in PRs #4285

Merged

cpcloud reviewed Jul 21, 2013
View reviewed changes

Updated All Documentation and CI Requirements

63b1012

jreback closed this Oct 9, 2013

jacobschaer mentioned this pull request Oct 17, 2013

BUG: Google BigQuery API Change #5255

Closed

ENH: Google BigQuery IO Module #4140

ENH: Google BigQuery IO Module #4140

Conversation

sean-schaefer commented Jul 5, 2013

cpcloud Jul 5, 2013

Choose a reason for hiding this comment

cpcloud commented Jul 5, 2013

jreback commented Jul 5, 2013

cpcloud commented Jul 5, 2013

cpcloud commented Jul 5, 2013

sean-schaefer commented Jul 5, 2013

sean-schaefer commented Jul 5, 2013

jreback commented Jul 5, 2013

sean-schaefer commented Jul 9, 2013

ghost commented Jul 12, 2013

sean-schaefer commented Jul 17, 2013

jtratner Jul 18, 2013

Choose a reason for hiding this comment

ghost commented Jul 18, 2013

azbones commented Jul 18, 2013

wesm commented Jul 18, 2013

cpcloud Jul 21, 2013

Choose a reason for hiding this comment

cpcloud commented Jul 21, 2013

sean-schaefer commented Jul 22, 2013

jreback commented Jul 22, 2013

sean-schaefer commented Jul 22, 2013

jreback commented Jul 22, 2013

jreback commented Oct 8, 2013

jacobschaer commented Oct 8, 2013

jreback commented Oct 8, 2013

jreback commented Oct 8, 2013

jacobschaer commented Oct 9, 2013

jreback commented Oct 9, 2013

jreback commented Oct 9, 2013

jreback commented Oct 9, 2013

cpcloud commented Oct 9, 2013

jreback commented Oct 9, 2013

jacobschaer commented Oct 9, 2013

jreback commented Oct 9, 2013

jacobschaer commented Oct 9, 2013

jacobschaer commented Oct 9, 2013

jreback commented Oct 9, 2013

jacobschaer commented Oct 9, 2013

cpcloud commented Oct 9, 2013

jacobschaer commented Oct 9, 2013

cpcloud commented Oct 9, 2013

jreback commented Oct 9, 2013

jacobschaer commented Oct 9, 2013

jtratner commented Oct 9, 2013

jreback commented Oct 9, 2013

jacobschaer commented Oct 10, 2013

jreback commented Oct 10, 2013

jacobschaer commented Oct 10, 2013

jacobschaer commented Oct 10, 2013