Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Google BigQuery IO Module #4140

Closed
wants to merge 2 commits into from

Conversation

sean-schaefer
Copy link

This module adds Google BigQuery functionality to the pandas project. The work is based on the existing sql.py and ga.py, providing methods to query a BigQuery database and read the results into a pandas DataFrame.

credentials: SignedJwtAssertionCredentials object
"""
try:
creds_file = open(self.private_key, 'r')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should use a context manager here. if the next line fails then private_key will not be closed.

@cpcloud
Copy link
Member

cpcloud commented Jul 5, 2013

Also, is there any way to test this?

@jreback
Copy link
Contributor

jreback commented Jul 5, 2013

needs docs too (release, whatsnew, io)

@cpcloud FYI ...need ga docs too

since I don't use these, what exactly does this do?

@cpcloud
Copy link
Member

cpcloud commented Jul 5, 2013

me neither, i created an issue a while ago to make ga docs...i guess low-prio for now.

bigquery is a relational db engine used for querying gigantic datasets. this looks like it puts a pandas wrapper around the google bigquery query language

@cpcloud
Copy link
Member

cpcloud commented Jul 5, 2013

i would say that based on the experience with the historical finance APIs that we should be very careful about network-ish APIs, insofar as they are difficult to test reliably. just my 2c

@sean-schaefer
Copy link
Author

We have test cases, but they're rather specific to our own usage. The problem is that the tests require our credentials, as well as having known datasets to compare results with. There are sample datasets provided by Google, though you'll still need an account.

Thanks for the other suggestions, we'll work on these and update when we finish.

@sean-schaefer
Copy link
Author

We updated the script with some of your error handling suggestions. Could you be more specific about what you need for documentation?

@jreback
Copy link
Contributor

jreback commented Jul 5, 2013

A usage example in here: http://pandas.pydata.org/pandas-docs/dev/io.html#data-reader (or prob create another section to include ga as well). can you code-block to show how to do it (which doesn't actually execute)

also a blurb (in enhancements) for release notes (doc/source/release.rst)

@sean-schaefer
Copy link
Author

Unfortunately, we did not include use cases for ga as well because we are not familiar with it, but we did write documentation for gbq in the io.rst and release.rst files. We also refactored the original script to improve ease of testing and committed a test suite that we've been using. There are several test cases that do require BigQuery credentials, so they will be skipped unless you hardcode those values into the script. The other tests use CSVs that we've included in the appropriate data directory.

Please let us know any other suggestions / requirements you have.

@ghost
Copy link

ghost commented Jul 12, 2013

Does anyone have more thoughts on this? During testing, we've noticed there are occasionally problems with OSX using Google's Python API (in particular the OpenSSL/PyCrypto modules) - we're investigating, since this will also affect the GA module, but it seems this may be out of our control. Otherwise, it's been fairly stable.

@sean-schaefer
Copy link
Author

When using our gbq module internally, we found that there are OpenSSL issues across platforms using the authentication method we employed. Although it works well on my local Ubuntu 12.04 system, we've had difficulty getting it to work on Snow Leopard and Windows 7.

There is another form of authentication through oauth2client, but we rejected that option because it cannot run headless as we plan on running this on an EC2 instance (users are required to grant access to the Google API through a browser window that pops-up during execution). However, for use through this library, do you feel it would be better to authenticate in that manner and make it more suitable cross-platform? This is a question we are currently considering for internal use as well.

@@ -820,7 +812,7 @@ rows will skip the interveaing rows.
print open('mi.csv').read()
pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1],tupleize_cols=False)

Note: The default behavior in 0.12 remains unchanged (``tupleize_cols=True``),
Note: The default behavior in 0.11.1 remains unchanged (``tupleize_cols=True``),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you force this to be your doc when rebasing? Looks like a number of version changes...

@ghost
Copy link

ghost commented Jul 18, 2013

Although not spelled out in our various dev docs, the established norm in pandas is to
avoid placing author names in source code. This applies to everyone from wesm on down.

Contributors large and small so far have accepted git log/blame as suitable credit, and
this can be hereby elevated to the level of a project dictum IMO.

I (and I believe other pandai) feel very strongly about religiously maintaining attribution and giving
proper credit in OSS projects. The accepted form for this is an acknowledgment in the release notes
by name or GH handle. We often do this unbidden when new contributors are involved to show
our gratitude for their contribution to the project.

So, please remove the Authors line from the code and feel free to add a "thanks to @sean-schaefer, Jacob Schaer"
note to the appropriate item in RELEASE.rst file.

@azbones
Copy link

azbones commented Jul 18, 2013

As far as whether to use OpenSSL or oauth2, I would vote for oauth2. The ga.py submodule uses this currently, many users are using Pandas in iPython, and it is significantly easier to install given it doesn't have all the platform specific OpenSSL dependencies. I spent quite a bit of time trying to make this work during testing and it was quite the hassle when using OpenSSL.

Also, I will volunteer to put the documentation together. I'll just need to research how to do that as I've never contributed documentation before...

@wesm
Copy link
Member

wesm commented Jul 18, 2013

The test file you added is pretty large. Can you trim it down in size?

raise

# If execution gets here, the file could not be opened
raise IOError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put the comment text in the IOError message

@cpcloud
Copy link
Member

cpcloud commented Jul 21, 2013

Docs on the other python libraries needed to use this would be great!

@sean-schaefer
Copy link
Author

We made a few changes as suggested by @cpcloud; thank you for the feedback. We're not sure where you want documentation on library dependencies, but we did add dependencies as a note in io.rst. It should be noted that there is presently a bug in BigQuery that is preventing our module from being 100% reliable, see:

http://stackoverflow.com/questions/17751944/google-bigquery-incomplete-query-replies-on-odd-attempts/

We are planning to implement pagination for the results so that datasets can be much larger and responses more reliable.

@jreback
Copy link
Contributor

jreback commented Jul 22, 2013

is httplib2 kind of like the request library?

these deps need go in: http://pandas.pydata.org/pandas-docs/dev/install.html#optional-dependencies

and you should also have tests that skip if these are not installed

@sean-schaefer
Copy link
Author

Yes, httplib2 is Google's extension of the original httplib library. I imagine request could do the same thing, but the examples and recommendations with the BigQuery API was to use httplib2.

We added the third party libraries to the install.rst, and added the import skip tests.

@jreback
Copy link
Contributor

jreback commented Jul 22, 2013

you might already have this, but on the call to the main api, read_gbq? need to raise with the deps if you don't find them. generally try the imports at the top of the module and then set a variable to indicate if they are successful, but don't raise (as the api files import your main module), only raise when the user tries to use

see core/expressions.py for an example of doing this (this actually falls back to using diffferent features, but its the same idea)

basically a user will try out your function and be like, hey I need these deps....(rather than having pandas auto install them or failing when pandas is imported, after all they are not required)

@jreback
Copy link
Contributor

jreback commented Oct 8, 2013

@jtratner @cpcloud any final comments?

@sean-schaefer @jacobschaer i'll rebase this when I put it in...

@jacobschaer
Copy link
Contributor

@jreback - Sorry I'm a little slow when it comes to git. I am pretty sure it's the way you want it now. I did seperate docs and ci stuff from the rest though.

@jreback
Copy link
Contributor

jreback commented Oct 8, 2013

@jacobschaer yes...looks fine now....

@jreback
Copy link
Contributor

jreback commented Oct 8, 2013

all I need to install is: easy_install bigquery right?

[sheep-jreback-~] bq
Traceback (most recent call last):
  File "/usr/local/bin/bq", line 8, in <module>
    load_entry_point('bigquery==2.0.15', 'console_scripts', 'bq')()
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 318, in load_entry_point
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 2221, in load_entry_point
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1954, in load
  File "build/bdist.linux-x86_64/egg/bq.py", line 39, in <module>
  File "build/bdist.linux-x86_64/egg/bigquery_client.py", line 26, in <module>
ImportError: cannot import name discovery

@jacobschaer
Copy link
Contributor

@jreback I thought that's all I installed. Per:
http://code.google.com/p/google-bigquery-tools/source/browse/bq/README.txt

easy_install bigquery

What interpreter are you using? This might be related to:
#5116

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

got it sorted...had an old version installed.....

bombs away....

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

merged via these commits:

390a2d6
2c60400
6ee748e

odd that github didn't close this PR....but in master now

@jreback jreback closed this Oct 9, 2013
@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

@sean-schaefer @jacobschaer

thanks for all of your nice work on this!

the pandas/big community will be happy!

and you get to support in perpetuity !

check out docs (built tomorrow by 5 est).

also checkout master and test...if anything pls submit a followup PR

@cpcloud
Copy link
Member

cpcloud commented Oct 9, 2013

one more step toward a pandopoly!

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

Is this supposed to test on under py3?
does it work in py3? I seem to remember you testing/saying it was working?

@jacobschaer
Copy link
Contributor

@jreback I have not tested under Python 3, and from some tests a while ago it seemed that it can't be supported in this version of bq.py. The issue appears to be related to their handling of unicode.

@cpcloud Soon...

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

ok can u test and see if need a nice failure message?

@jacobschaer
Copy link
Contributor

@jreback - I'll see what I can do. I've been using a mac and have been hesitant to do Python 3.

@azbones - Good news, Google is actively working on the 100k result bug. As I understand it, no changes will need to be made to our code as it's entirely backend. We can then uncomment the test I made for this situation.

http://stackoverflow.com/questions/19145587/bq-py-not-paging-results

@jacobschaer
Copy link
Contributor

@jreback Do we have directions for getting this up and running on Python 3? We kept running into troubles with cython and various dependencies.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

what do you mean, installing py3, pandas? or bq? py3? you are on linux, right?

@jacobschaer
Copy link
Contributor

I was trying to test on Linux, yes. I had a setup Ubuntu virtual machine, and we got python 3 installed, but were having some problems building our repository.
We did something to the effect of:

apt-get install python3, python3-dev, python-pandas, git, cython
git clone https://github.com/sean-schaefer/pandas.git
python pandas/setup.py develop

@cpcloud
Copy link
Member

cpcloud commented Oct 9, 2013

u need to be in the pandas directory...then run python setup.py develop

@jacobschaer
Copy link
Contributor

... sorry, that's what we did. I was just typing this up from memory. We were getting errors like in:
#2439 (comment)

We were also having trouble getting all the python3 versions of things from apt-get.

@cpcloud
Copy link
Member

cpcloud commented Oct 9, 2013

is there maybe a cython3 that you need instead?

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

@jacobschaer I use pip3 for most of the pandas installs (e.g. dateutil,cython)

@jacobschaer
Copy link
Contributor

On Ubuntu 12.4, we did roughly the following...

sudo apt-get install python3, python3-dev, cython, python3-setuptools
sudo easy_install3 pip
...
pip3-2 install git+https://github.com/sean-schaefer/pandas.git

We eventually got:

Exception: Cython-generated file 'pandas/index.c' not found ...

See:
https://gist.github.com/clayton/c658f4d9e20afc635e35

@jtratner
Copy link
Contributor

jtratner commented Oct 9, 2013

I believe you can actually compile with Python 2 Cython and use with Py 3
On Oct 9, 2013 5:08 PM, "jacobschaer" [email protected] wrote:

On Ubuntu 12.4, we did roughly the following...

sudo apt-get install python3, python3-dev, cython, python3-setuptools
sudo easy_install3 pip
...
pip3-2 install git+https://github.com/sean-schaefer/pandas.git

We eventually got:

Exception: Cython-generated file 'pandas/index.c' not found ...


Reply to this email directly or view it on GitHubhttps://github.com//pull/4140#issuecomment-26008906
.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

@sean-schaefer @jacobschaer

docs are built: http://pandas.pydata.org/pandas-docs/dev/io.html#google-bigquery-experimental

  • need a little blurb/example for the v0.13.0 announcements (you can paste here or put in a new PR ....just need a short section in experimental, can be the same example in the docs, this is just to give users a quick taste.)
  • I would remove this part of the docs:
The general structure of this module and its provided functions are based loosely on those in

pandas.io.sql.
  • I think you can add a to_gbq method in core/generic.py that just calls gbq.to_gbq(self....)...similar to how the other to_**** methods work (e.g. see to_hdf)
  • pls add the to/read gbq to doc/source/api.rst

all of these (plus any p3k changes) can be rolled into a single PR

thanks

@jacobschaer
Copy link
Contributor

What kind of blurb/example would you want, and what file should it go in? I came up with a moderately interesting example:

query = """SELECT station_number as STATION, month as MONTH, AVG(mean_temp) as MEAN_TEMP FROM publicdata:samples.gsod
WHERE YEAR = 2000 
GROUP BY STATION, MONTH 
ORDER BY STATION, MONTH ASC"""

df = gbq.read_gbq(query)
df2 = df.pivot(index='STATION', columns='MONTH', values='MEAN_TEMP')
df3 = pandas.concat([df2.min(), df2.mean(), df2.max()], axis=1,keys=["Min Tem", "Mean Temp", "Max Temp"])

Yields the monthly min, mean, and max US Temperatures for the year 2000 using NOAA gsod data.

         Min Tem  Mean Temp    Max Temp
MONTH                                  
1     -53.336667  39.827892   89.770968
2     -49.837500  43.685219   93.437932
3     -77.926087  48.708355   96.099998
4     -82.892858  55.070087   97.317240
5     -92.378261  61.428117  102.042856
6     -77.703334  65.858888  102.900000
7     -87.821428  68.169663  106.510714
8     -89.431999  68.614215  105.500000
9     -86.611112  63.436935  107.142856
10    -78.209677  56.880838   92.103333
11    -50.125000  48.861228   94.996428
12    -50.332258  42.286879   94.396774

As far as a blurb, perhaps:
The gbq module provides a simple way to extract from and load data into Google's BigQuery Data Sets by way of pandas DataFrames. BigQuery is a high performance SQL-like database service, useful for performing ad-hoc queries against extremely large datasets.

I'll make those doc changes and put them in a seperate pull request. I still have not been able to successfully test pandas in python 3. I should be able to get the other two changes done pretty quickly today.

@jreback
Copy link
Contributor

jreback commented Oct 10, 2013

@jacobschaer is that an example for a public/sample dataset? e.g. that could in theory reproduced by a user? (those are the best kind!) put a small example in doc/source/v0.13.0.txt in experimental section (fyi...be sure to pull from master as there were some edits today)

you can put larger/examples edit docs at your leisure......

@jacobschaer
Copy link
Contributor

Will do. Those are public sample datasets provided on all GBQ accounts. As long as they have a BigQuery account that has API access (I think accounts not fully setup only get web access), this will take them through the auth process if they haven't already and then work perfectly.

@jacobschaer
Copy link
Contributor

#5179

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants