Skip to content
This repository has been archived by the owner on Jan 18, 2020. It is now read-only.

Support for job queue and background execution of queries #231

Closed
Tracked by #41
bruth opened this issue Sep 24, 2014 · 16 comments
Closed
Tracked by #41

Support for job queue and background execution of queries #231

bruth opened this issue Sep 24, 2014 · 16 comments

Comments

@bruth
Copy link
Contributor

bruth commented Sep 24, 2014

The scale of some Harvest applications have reached the point that queries are taking longer than what is appropriate to classify as "real-time adhoc queries". This is still the target use case, however some applications do not necessarily require the real-time aspect, but still want the power of the query generation.

The current pipeline of query execution does not take in consideration long-running queries and therefore is naively executed in the main thread of the program which blocks until results are returned. This is problematic for two reasons:

  • No other operations can be performed in the main thread
  • The only method for canceling the query is signaling an interrupt (Ctrl+C) on the process (assuming an interactive session) or terminating the process by an external means

This puts the state of the application and database in a potentially unknown/broken state which ultimately impacts the user. A common response to canceling the query is to try it again which could compound the issue if the previous query was not canceled at the database level.

The high-level approach is to:

  • Introduce the notion of a "job" that corresponds to some external processing such as a database query, analysis pipeline, etc.
  • Focus on user-defined query jobs first since they are the current issue since they are much more difficult to predict (due to their arbitrary nature)
  • Define various states which can be used to report on the state of the query by the main
    • e.g. queued, starting, running, finished, canceled, error, expired
  • Add mechanisms for spawning a thread or process that will dequeue and handle the lifecycle of the job (updating state as it's being processed)
  • Provide a API for accessing a job and getting it's state

Technical considerations:

  • The job API must supporting canceling/killing the job
    • This would terminate the thread/process, but also propagate to external systems such as canceling a database query
    • This prevents the stampeding effect and keeps the systems clean of rogue state
  • The output of the job (i.e. results from the database) must be stored somewhere
    • The data is no longer being produced/streamed into the main thread and therefore is not accessible. Possible solutions include using:
      • Shared data structures between threads
      • Cross-process communication
      • External system for temporary storage (this feels safest and most flexible for future growth)
    • All data must have an expiration so the interim storage does not balloon
  • Any performance loss of this process should not be noticeable to typical Harvest projects
    • Ideally this would be the only pipeline, however if needed there could be an expedited path which skips the queue and executes it immediately
  • Although it theoretically makes sense to implement in Python, it may not need to be if the abstraction is well defined
@bruth bruth changed the title Support for background execute and job queue of queries Support for job queue and background execution of queries Sep 24, 2014
@bruth
Copy link
Contributor Author

bruth commented Sep 25, 2014

In my opinion, the biggest decision of all of this is how exactly the jobs will get run. The below diagram a services based approach which feels the most flexible in the long-term (click on it open to view a larger version of it):

harvest-background-queries

@bruth
Copy link
Contributor Author

bruth commented Sep 25, 2014

I will also state explicitly that this is intended to complement the current query processing pipeline and not replace it at this time. It simply adds asynchronous processing to the architecture which allows for more control over execution and added flexibility using the service based approach.

@bruth bruth self-assigned this Sep 25, 2014
@murphyke
Copy link
Member

Is there a way in Python-RQ to cancel a task? If there is one, we would need a suitable pre-cancellation hook to be able to run database-specific query cancellation code. If there isn't, we can do the query cancellation outside of Python-RQ. In any case, for PostgreSQL we will want to be able to get the query backend's pid from the RQ task so we can invoke pg_cancel_backend(pid). psycopg2 allows you to get this pid using connection.get_backend_pid().

@naegelyd
Copy link
Collaborator

I definitely agree with the above approach. My one major question is the storing of the results. Let's say I request a long query so I get the UUID back and I can monitor the job. All of this makes sense and is well outlined above. Additionally, let's say that not only will this query be long running but it will produce a large number of results so the resulting data will be quite large. So, once the job finishes, I am able to retrieve the results and everything is fine. Now, let's say a couple days later, I want those results again. Are they still being stored and linked to the job ID or are they purged after they are retrieved the first time?

I am just unclear as to when results will be purged and how and where exactly they will be stored. I just think we need to anticipate not only long jobs but also jobs that generate a significant amount of data. I don't think we can adopt the approach of keeping all job results forever so clearly deciding on a purging mechanism(rolling files, time-based expiring DB entries, etc.) will be important and I think should be outlined here before work begins. Other than that, I like the approach above and think it all makes sense.

@bruth
Copy link
Contributor Author

bruth commented Sep 25, 2014

I was not intending to use Python-RQ since that is limited to executing Python code. I want to try a service-based approach (over HTTP) that establishes a standard way of executing (GET/POST) the task and optionally canceling the task (DELETE) (for tasks that can be cancelled). The advantages with this approach:

  • From the worker's standpoint, the service is a black box and just sends and receives bytes. It is assumed the task "queuer" knows what to expect/do with the response data once it retrieves it from the result store.
  • Because of the above point a service can be anything such as an external service, intense computation, or a database query execution
  • The queue processing pipeline has zero dependency on the application or Harvest. That sounds a bit odd, but it means that it can be iterated on independently and optimized to suit it's role.

What this means in practice is that the actual database execution is being done via this service interaction, for example:

POST /psql/execute/<uuid>/
{
    "statement": "...",
    "parameters": [...]
}

Obviously this would be bare bones and assuming the connection/credentials are already known by the service.

The idea is that Serrano would define a "service-compatible" endpoint for executing the query and it would be a consumer of itself. It would queue a task to send a request to itself and the response would be sent to the result store.

As mentioned above, the service would need to support GET/POST to "execute the task", e.g. /run-task/<uuid>/and optionally support DELETE to cancel the task (to the same URL).

@bruth
Copy link
Contributor Author

bruth commented Sep 25, 2014

@naegelyd I thought about that too. It could be a task option that prevents it from being purged, such as with query results that may be used across sessions. Of course this only works as long as the results are valid, otherwise the data would have to be purged like the application cache.

@bruth
Copy link
Contributor Author

bruth commented Sep 25, 2014

@murphyke To give a concrete example, I envision a barebones service to look like this (assume this is a single process and thread-safe):

import psycopg2
from flask import Flask, abort, request

app = Flask()

tasks = {}

@app.route('/<uuid>/', methods=['POST'])
def run(uuid):
    conn = psycopg2.connect(...)
    tasks[uuid] = conn

    data = request.json

    try:
        c = conn.cursor()
        c.execute(data['statement'], data['parameters'])
        return c.fetchall()
    except:
        conn.cancel()
        abort(500)
    finally:
        tasks.pop(uuid)

@app.route('/<uuid>/', methods=['DELETE'])
def cancel(uuid):
    if uuid not in tasks:
        abort(404)

    tasks.pop(uuid).cancel()

@bruth
Copy link
Contributor Author

bruth commented Sep 25, 2014

Also assume the data being returned is serialized into something such as a JSON array or pickled if only Python will be the consumer.

@murphyke
Copy link
Member

@bruth Yeah, I saw Redis and workers in the diagram and thought you might be thinking of layering on top of RQ (or other existing task queuing system). So does this have to be part of Serrano at all? It sounds like a generic task queuing system with a REST API. Is there an existing component out there that we could use?

@bruth
Copy link
Contributor Author

bruth commented Sep 25, 2014

That is what I am investigating now. At a minimum Serrano would merely be a light wrapper to make it easier to interact with the tasks Harvest is queuing and consuming from Cilantro's perspective.

@bruth
Copy link
Contributor Author

bruth commented Sep 25, 2014

@murphyke Well the task queue doesn't have the REST API, it uses HTTP as the protocol for communicating to services.

@bruth
Copy link
Contributor Author

bruth commented Sep 25, 2014

@murphyke I just realized what you meant re: it being a REST interface from Avocado standpoint. It basically needs to be an http proxy + named queues + response storage.

@tjrivera
Copy link

I'm going to attempt some sort of implementation of HTQ into the Harvest stack with the CBTTC Harvest application as a model.

@bruth
Copy link
Contributor Author

bruth commented Jan 12, 2015

I am working on the query canceling part: #280

@bruth
Copy link
Contributor Author

bruth commented Jan 28, 2015

I just made an observation that canceling an export after the formatter processing has begun does not stop the processing. This is another reason why queries needs to be executed in a separate thread or process so they can be managed.

@bruth bruth assigned naegelyd and unassigned bruth Feb 8, 2015
@bruth
Copy link
Contributor Author

bruth commented Feb 8, 2015

It has been many months since the original inception and description of this issue, so I want to clarify and restate the intent of this issue.

The outcome of this feature is to provide more control over queries being executed and the results they produce. Control is defined conceptually as:

  • Visibility of the queries currently running
  • Management over when queries execute to control either system resources or throttle resources per user
  • Some notion of persistence of a query and its results for later reference

The motivation for this feature is the ability for users to execute a query and access the results at a later time. In Cilantro, this could be implemented in two contexts:

  • Rather than clicking "View Results" which would immediately execute the query, one could (hypothetically) click "Run For Later"
  • The user clicks "View Results", but realizes or is told (Add DDL Analysis Subcommand and Facilities #296) the query may take a while and wants to "background" the query so they can move on

In either case, the requirements are the same:

  • A means of persisting a "planned" query and keeping track of it's state
  • Storing the results of query that can be reference at a later time

We have persistent queries in the form of a saved DataQuery tied to a user. This may or may not be appropriate for this feature, but it's a starting point for thinking about what may need to be added to the API.

The current thought for storing the results is to simply cache them (#109) using the current cache APIs. This is likely to be good enough until there is more evidence of how this feature is being used.

With these base features already in place, this issue is distilled down to keeping tracking of all planned, running, and cached ad-hoc queries. The API should support:

Another way to think about it is that a administrative page of the running queries and their properties can be displayed with the ability to click through to the results or cancel it if it's running.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants