-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages. #7293
Comments
@tswast I don't know whether this is just a docs issue (clarify that |
That whole https://cloud.google.com/bigquery/docs/paging-results page is pretty irrelevent when using the client library. The What is the reason that you'd want a dataframe for only a single page? Potentially this could be a feature request to provide a helper on the page property of the RowIterator to construct a dataframe if there is a reason this is needed. |
I couldn't find any documentation about how the RowIterator takes care of pagination. At least not anything other than using the next_page_token at the time of writing this issue. Now I have found the documentation about it. With that in mind, this is really just a documentation problem. However, I'll describe what I was trying to do, and if you think that it falls in line with the good-practices, we can open a feature request (I am also willing to make the pull request myself, but might need some help since I have not contributed to this repo before). What I was doingWe have a small dataset in the cloud which is normally processed with C, and we have some tools for that C library which wrap it in Python3. It is required that we re-process this data for an analysis, but since Python3 is not yet supported by DataFlow, the plan was to just process it by paging through it, and uploading results one page at a time since we will be doing it locally on a RAM limited machine. I tried to load a page, and wanted to process just that one page fro the sake of keeping memory free for the analysis. However, working with a dataframe would be ideal since some of our methods support dataframes as input. Then when using the to_dataframe method. We ended up with a lot more memory in use, and when increasing the amount of data being imported by this tool, we would not have enough memory left to do anything with it. Maybe this is worth opening the feature request that @tswast suggested.. However for now, Documentation fixesI think we need to
|
The most needed change is in, for the generated documentation, I am unsure about how to add a hyperlink, or what the best solution is. Shouldn't the documentation language automatically find the HTTPIterator, and let us click on the name to go to it since it belongs to the same repo? |
Sphinx doesn't show inheritance by default. I believe we'd need to add |
Hi, I just wanted to check whether I understood correctly: When iterating over an instance of RowIterator, the an API request is sent to BQ for each row (element) in the iterator? I'm wondering how much data is loaded into application memory. The first case - first iteration loads only one row, in the second case the application loads for the entire batch? Thank you! |
Environment details
python version: 3.7.2
virtual environment: Conda managed
pip freeze | grep 'google
Problem
The strategy for paginating through a table in BigQuery with RowIterator.to_dataframe() does not work as expected.
Either
Steps to reproduce
Client.list_rows()
method.Code example
The text was updated successfully, but these errors were encountered: