-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Google BigQuery IO Module #4140
Conversation
credentials: SignedJwtAssertionCredentials object | ||
""" | ||
try: | ||
creds_file = open(self.private_key, 'r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should use a context manager here. if the next line fails then private_key
will not be closed.
Also, is there any way to test this? |
needs docs too (release, whatsnew, io) @cpcloud FYI ...need ga docs too since I don't use these, what exactly does this do? |
me neither, i created an issue a while ago to make ga docs...i guess low-prio for now. bigquery is a relational db engine used for querying gigantic datasets. this looks like it puts a pandas wrapper around the google bigquery query language |
i would say that based on the experience with the historical finance APIs that we should be very careful about network-ish APIs, insofar as they are difficult to test reliably. just my 2c |
We have test cases, but they're rather specific to our own usage. The problem is that the tests require our credentials, as well as having known datasets to compare results with. There are sample datasets provided by Google, though you'll still need an account. Thanks for the other suggestions, we'll work on these and update when we finish. |
We updated the script with some of your error handling suggestions. Could you be more specific about what you need for documentation? |
A usage example in here: http://pandas.pydata.org/pandas-docs/dev/io.html#data-reader (or prob create another section to include ga as well). can you code-block to show how to do it (which doesn't actually execute) also a blurb (in enhancements) for release notes (doc/source/release.rst) |
Unfortunately, we did not include use cases for ga as well because we are not familiar with it, but we did write documentation for gbq in the io.rst and release.rst files. We also refactored the original script to improve ease of testing and committed a test suite that we've been using. There are several test cases that do require BigQuery credentials, so they will be skipped unless you hardcode those values into the script. The other tests use CSVs that we've included in the appropriate data directory. Please let us know any other suggestions / requirements you have. |
Does anyone have more thoughts on this? During testing, we've noticed there are occasionally problems with OSX using Google's Python API (in particular the OpenSSL/PyCrypto modules) - we're investigating, since this will also affect the GA module, but it seems this may be out of our control. Otherwise, it's been fairly stable. |
When using our gbq module internally, we found that there are OpenSSL issues across platforms using the authentication method we employed. Although it works well on my local Ubuntu 12.04 system, we've had difficulty getting it to work on Snow Leopard and Windows 7. There is another form of authentication through oauth2client, but we rejected that option because it cannot run headless as we plan on running this on an EC2 instance (users are required to grant access to the Google API through a browser window that pops-up during execution). However, for use through this library, do you feel it would be better to authenticate in that manner and make it more suitable cross-platform? This is a question we are currently considering for internal use as well. |
@@ -820,7 +812,7 @@ rows will skip the interveaing rows. | |||
print open('mi.csv').read() | |||
pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1],tupleize_cols=False) | |||
|
|||
Note: The default behavior in 0.12 remains unchanged (``tupleize_cols=True``), | |||
Note: The default behavior in 0.11.1 remains unchanged (``tupleize_cols=True``), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you force this to be your doc when rebasing? Looks like a number of version changes...
Although not spelled out in our various dev docs, the established norm in pandas is to Contributors large and small so far have accepted I (and I believe other pandai) feel very strongly about religiously maintaining attribution and giving So, please remove the Authors line from the code and feel free to add a "thanks to @sean-schaefer, Jacob Schaer" |
As far as whether to use OpenSSL or oauth2, I would vote for oauth2. The ga.py submodule uses this currently, many users are using Pandas in iPython, and it is significantly easier to install given it doesn't have all the platform specific OpenSSL dependencies. I spent quite a bit of time trying to make this work during testing and it was quite the hassle when using OpenSSL. Also, I will volunteer to put the documentation together. I'll just need to research how to do that as I've never contributed documentation before... |
The test file you added is pretty large. Can you trim it down in size? |
raise | ||
|
||
# If execution gets here, the file could not be opened | ||
raise IOError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put the comment text in the IOError
message
Docs on the other python libraries needed to use this would be great! |
We made a few changes as suggested by @cpcloud; thank you for the feedback. We're not sure where you want documentation on library dependencies, but we did add dependencies as a note in io.rst. It should be noted that there is presently a bug in BigQuery that is preventing our module from being 100% reliable, see: We are planning to implement pagination for the results so that datasets can be much larger and responses more reliable. |
is these deps need go in: http://pandas.pydata.org/pandas-docs/dev/install.html#optional-dependencies and you should also have tests that skip if these are not installed |
Yes, We added the third party libraries to the |
you might already have this, but on the call to the main api, see basically a user will try out your function and be like, hey I need these deps....(rather than having pandas auto install them or failing when pandas is imported, after all they are not required) |
@jtratner @cpcloud any final comments? @sean-schaefer @jacobschaer i'll rebase this when I put it in... |
@jreback - Sorry I'm a little slow when it comes to git. I am pretty sure it's the way you want it now. I did seperate docs and ci stuff from the rest though. |
@jacobschaer yes...looks fine now.... |
all I need to install is:
|
@jreback I thought that's all I installed. Per:
What interpreter are you using? This might be related to: |
got it sorted...had an old version installed..... bombs away.... |
thanks for all of your nice work on this! the pandas/big community will be happy! and you get to support in perpetuity ! check out docs (built tomorrow by 5 est). also checkout master and test...if anything pls submit a followup PR |
one more step toward a pandopoly! |
Is this supposed to test on under py3? |
ok can u test and see if need a nice failure message? |
@jreback - I'll see what I can do. I've been using a mac and have been hesitant to do Python 3. @azbones - Good news, Google is actively working on the 100k result bug. As I understand it, no changes will need to be made to our code as it's entirely backend. We can then uncomment the test I made for this situation. http://stackoverflow.com/questions/19145587/bq-py-not-paging-results |
@jreback Do we have directions for getting this up and running on Python 3? We kept running into troubles with cython and various dependencies. |
what do you mean, installing py3, pandas? or bq? py3? you are on linux, right? |
I was trying to test on Linux, yes. I had a setup Ubuntu virtual machine, and we got python 3 installed, but were having some problems building our repository.
|
u need to be in the |
... sorry, that's what we did. I was just typing this up from memory. We were getting errors like in: We were also having trouble getting all the python3 versions of things from apt-get. |
is there maybe a cython3 that you need instead? |
@jacobschaer I use |
On Ubuntu 12.4, we did roughly the following...
We eventually got:
|
I believe you can actually compile with Python 2 Cython and use with Py 3
|
docs are built: http://pandas.pydata.org/pandas-docs/dev/io.html#google-bigquery-experimental
all of these (plus any p3k changes) can be rolled into a single PR thanks |
What kind of blurb/example would you want, and what file should it go in? I came up with a moderately interesting example:
Yields the monthly min, mean, and max US Temperatures for the year 2000 using NOAA gsod data.
As far as a blurb, perhaps: I'll make those doc changes and put them in a seperate pull request. I still have not been able to successfully test pandas in python 3. I should be able to get the other two changes done pretty quickly today. |
@jacobschaer is that an example for a public/sample dataset? e.g. that could in theory reproduced by a user? (those are the best kind!) put a small example in doc/source/v0.13.0.txt in experimental section (fyi...be sure to pull from master as there were some edits today) you can put larger/examples edit docs at your leisure...... |
Will do. Those are public sample datasets provided on all GBQ accounts. As long as they have a BigQuery account that has API access (I think accounts not fully setup only get web access), this will take them through the auth process if they haven't already and then work perfectly. |
This module adds Google BigQuery functionality to the pandas project. The work is based on the existing sql.py and ga.py, providing methods to query a BigQuery database and read the results into a pandas DataFrame.