-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint #589
Comments
@kdeggelman I'm not sure it would be feasible to change how the said endpoint behaves on the backend. Even if POST-ing to it directly, it just creates a new job to run the actual query, and the code would still need to fetch query results using the job ID obtained from the first call's response. As an alternative, have you maybe considered using the faster BigQuery Storage API? It's more performant than the REST API, allows streaming the rows over multiple streams, etc. Could that be something that can help with your use case? |
Thanks for the help @plamut!
The doc you linked to for
The Storage API definitely looks promising, as I'm sure we'll run into cases where our queries go beyond the timeout of the That being said, I think it could still make sense to support the |
This is a tricky one. |
We got partway through implementing this in #362 before reverting a few of those PRs. The other problem we encountered is that to do this right, we'd have to cache that first page of results, which was unexpected and undesired for a large class of use cases: #394 That said, I think we could support this ask, but only if it's behind a feature flag. Perhaps we add a |
@tswast thanks for providing all those links for context. It's clear that there is no one-size-fits-all answer to the problem of optimally retrieving query results. I think adding some sort of explicit flag to use In the meantime, I'm pretty happy with my workaround: from google.cloud.bigquery.query import _QueryResults
synchronous_api_response = bq_client._connection.api_request(
"POST",
f"/projects/{bq_client.project}/queries",
data={"query": query, "useLegacySql": False},
)
return _QueryResults.from_api_repr(synchronous_api_response).rows |
@tswast Could you elaborate on how you imagine this addition should look like from the user perspective?
If Also asking because if the existing method is updated, we need to think about what to do with parameters such as |
@plamut I have a sketch of a plan in #362
Very true! I'd still imagine returning a Whether the job is finished or not, we'd return a
These are not irrelevant. Many (but not all, unfortunately) of the parameters in As you can see from https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#queryrequest the request object to |
Oh, and the other thing we'll require is the "first page" hack that I initially implemented in RowIterator. We might need some additional hooks there because we won't have the destination table ID available. This means that |
This is 1/2 done on the That said, it's missing all the optimizations to make it useful. It calls |
This issue has been open for three years and while some effort has been made there appears to be a significant amount of work that needs to be done. Besides the initial request three years ago, there does not appear to be a large user base requesting this feature. Closing due to workload and other priorities. |
#1723 finished up the work planned for this project. With that change, when you run Note: it still does a call to the |
@tswast thank you for working on this! I am reading your PR and wanted to clarify the usage, as we're working to reduce the latency between query creation time and serving of results from (typically) single page queries. Our usage at the moment looks roughy like: job = client.query(query=sql, job_config=job_config)
return [result for result in job.result(
start_index=start_index, max_results=limit, retry=retry, timeout=timeout)
] My understanding is we will need to update our |
@nickmarx12345678 Yes, to avoid an extra call to getQueryResults, use Also note: my in flight PR #1722 optimized this further by introducing a query_and_wait method that can avoid an unnecessary call to get the job metadata as well. |
Thanks! We have already started testing w/ prerelease version and can see the difference in traces. One observation I wanted to share: It appears in the optimized case, the resultant row_iterator may have its e.g. job = client.query(query=sql, job_config=job_config, api_method="QUERY")
result = job.result(start_index=start_index, max_results=limit, retry=retry, timeout=timeout)
rows = [row for rows in result]
print(result.total_rows)
# None |
@nickmarx12345678 I noticed that issue too in https://github.com/googleapis/python-bigquery/pull/1722/files#diff-1172a63bba9d23e3a24e43736ef1275452ef2ef2d4a50d054c209bdd2792fdf5 and have a fix included in that PR. I can pull that fix out into a separate PR so it's not blocked if you prefer. |
#1748 fix for |
Is your feature request related to a problem? Please describe.
We are part of the BI Engine preview so we're trying to minimize the latency between our python application and BigQuery. The current approach of creating an asynchronous job and then waiting for it to complete adds significant latency and hurts the value proposition of BI Engine.
Describe the solution you'd like
We'd like a way to utilize the
v2/projects/{projectId}/queries
endpoint to execute a query and wait for the response.Describe alternatives you've considered
We could call the REST API directly. The main drawbacks are 1) authentication with a service account 2) the existing python client does a nice job of creating
Row
objects with the results.I've also used the following snippet:
The text was updated successfully, but these errors were encountered: