-
Notifications
You must be signed in to change notification settings - Fork 10
Support for job queue and background execution of queries #231
Comments
I will also state explicitly that this is intended to complement the current query processing pipeline and not replace it at this time. It simply adds asynchronous processing to the architecture which allows for more control over execution and added flexibility using the service based approach. |
Is there a way in Python-RQ to cancel a task? If there is one, we would need a suitable pre-cancellation hook to be able to run database-specific query cancellation code. If there isn't, we can do the query cancellation outside of Python-RQ. In any case, for PostgreSQL we will want to be able to get the query backend's pid from the RQ task so we can invoke pg_cancel_backend(pid). psycopg2 allows you to get this pid using |
I definitely agree with the above approach. My one major question is the storing of the results. Let's say I request a long query so I get the UUID back and I can monitor the job. All of this makes sense and is well outlined above. Additionally, let's say that not only will this query be long running but it will produce a large number of results so the resulting data will be quite large. So, once the job finishes, I am able to retrieve the results and everything is fine. Now, let's say a couple days later, I want those results again. Are they still being stored and linked to the job ID or are they purged after they are retrieved the first time? I am just unclear as to when results will be purged and how and where exactly they will be stored. I just think we need to anticipate not only long jobs but also jobs that generate a significant amount of data. I don't think we can adopt the approach of keeping all job results forever so clearly deciding on a purging mechanism(rolling files, time-based expiring DB entries, etc.) will be important and I think should be outlined here before work begins. Other than that, I like the approach above and think it all makes sense. |
I was not intending to use Python-RQ since that is limited to executing Python code. I want to try a service-based approach (over HTTP) that establishes a standard way of executing (GET/POST) the task and optionally canceling the task (DELETE) (for tasks that can be cancelled). The advantages with this approach:
What this means in practice is that the actual database execution is being done via this service interaction, for example:
Obviously this would be bare bones and assuming the connection/credentials are already known by the service. The idea is that Serrano would define a "service-compatible" endpoint for executing the query and it would be a consumer of itself. It would queue a task to send a request to itself and the response would be sent to the result store. As mentioned above, the service would need to support GET/POST to "execute the task", e.g. |
@naegelyd I thought about that too. It could be a task option that prevents it from being purged, such as with query results that may be used across sessions. Of course this only works as long as the results are valid, otherwise the data would have to be purged like the application cache. |
@murphyke To give a concrete example, I envision a barebones service to look like this (assume this is a single process and thread-safe): import psycopg2
from flask import Flask, abort, request
app = Flask()
tasks = {}
@app.route('/<uuid>/', methods=['POST'])
def run(uuid):
conn = psycopg2.connect(...)
tasks[uuid] = conn
data = request.json
try:
c = conn.cursor()
c.execute(data['statement'], data['parameters'])
return c.fetchall()
except:
conn.cancel()
abort(500)
finally:
tasks.pop(uuid)
@app.route('/<uuid>/', methods=['DELETE'])
def cancel(uuid):
if uuid not in tasks:
abort(404)
tasks.pop(uuid).cancel() |
Also assume the data being returned is serialized into something such as a JSON array or pickled if only Python will be the consumer. |
@bruth Yeah, I saw Redis and workers in the diagram and thought you might be thinking of layering on top of RQ (or other existing task queuing system). So does this have to be part of Serrano at all? It sounds like a generic task queuing system with a REST API. Is there an existing component out there that we could use? |
That is what I am investigating now. At a minimum Serrano would merely be a light wrapper to make it easier to interact with the tasks Harvest is queuing and consuming from Cilantro's perspective. |
@murphyke Well the task queue doesn't have the REST API, it uses HTTP as the protocol for communicating to services. |
@murphyke I just realized what you meant re: it being a REST interface from Avocado standpoint. It basically needs to be an http proxy + named queues + response storage. |
I'm going to attempt some sort of implementation of HTQ into the Harvest stack with the CBTTC Harvest application as a model. |
I am working on the query canceling part: #280 |
I just made an observation that canceling an export after the formatter processing has begun does not stop the processing. This is another reason why queries needs to be executed in a separate thread or process so they can be managed. |
It has been many months since the original inception and description of this issue, so I want to clarify and restate the intent of this issue. The outcome of this feature is to provide more control over queries being executed and the results they produce. Control is defined conceptually as:
The motivation for this feature is the ability for users to execute a query and access the results at a later time. In Cilantro, this could be implemented in two contexts:
In either case, the requirements are the same:
We have persistent queries in the form of a saved The current thought for storing the results is to simply cache them (#109) using the current cache APIs. This is likely to be good enough until there is more evidence of how this feature is being used. With these base features already in place, this issue is distilled down to keeping tracking of all planned, running, and cached ad-hoc queries. The API should support:
Another way to think about it is that a administrative page of the running queries and their properties can be displayed with the ability to click through to the results or cancel it if it's running. |
The scale of some Harvest applications have reached the point that queries are taking longer than what is appropriate to classify as "real-time adhoc queries". This is still the target use case, however some applications do not necessarily require the real-time aspect, but still want the power of the query generation.
The current pipeline of query execution does not take in consideration long-running queries and therefore is naively executed in the main thread of the program which blocks until results are returned. This is problematic for two reasons:
This puts the state of the application and database in a potentially unknown/broken state which ultimately impacts the user. A common response to canceling the query is to try it again which could compound the issue if the previous query was not canceled at the database level.
The high-level approach is to:
queued
,starting
,running
,finished
,canceled
,error
,expired
Technical considerations:
The text was updated successfully, but these errors were encountered: