-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ocrd network
: defer enqueue until workspace is free
#1046
Comments
I think this approach is better and more direct than creating a unique
Sounds good. But how will the Processing Server know which processing requests are chained together in a workflow? Since there is already a differentiation between |
The latter is not necessary AFAICS. The Processing Server must simply assume that all requests pertaining to the same workspace have a dependency relation and therefore enqueue them in the order they were received (regardless of where from – the Workflow Server in a single workflow, or multiple workflows, or independent, single-step requests to the Processing Server). |
Yes, but the Processing Server is just a proxy (with some validation on top) as it is now and has no other role. I am not sure how the buffering of processing requests should happen without some sort of caching mechanism inside the Processing Server or the Database itself to enqueue all the requests of a specific Workspace. |
The PS is the only one creating and enqueuing the processing requests. That's where you can control things. You don't enqeue to the "public" side (where workers can consume) more than one job per workspace at a time. The internal queue thus enforces order. See my explanation above. |
I was just thinking about a solution without utilizing any internal network queues at all (avoid additional connections to the RabbitMQ server). The Processing Server will be the only one pushing and pulling from these queues anyway. So maybe utilizing a Python list of Once the Processing Worker/Processor Server is done with the execution they can use a callback (if provided in the request) method This of course then removes the need for the Workflow Server to poll the processing request statuses. A callback method can be invoked by the Processing Server to the Workflow Server to indicate that all processing requests for a However, an implementation should not come without considering the error handling and potential problems as well. Maybe I am missing something, but isn't the Nextflow Script becoming unneeded then? We can pretty much use plain |
Indeed. Much simpler!
Yes, exactly.
I think would make life easier, actually. We used to skip this level, because the client could directly talk to the queue. But this is a better separation of concerns IMO. But I did not revisit the error handling concept specifically, yet.
We would still need Nextflow to parse the workflow syntax and slice it up into single processing calls with all the parameters involved. Also, Nextflow can still help us with error handling and re-processing IIRC. |
Yes, but we should not forget that invoking some endpoint of the Worklfow Server to register errors from the Processing Server cannot affect the runtime execution of a Nextflow Workflow. That's the reason why polling processing reuqests is needed inside the script itself. This allows workflow branching during runtime as well. |
Sorry, have not immersed myself into our chain of thought again: Would that by any chance be something that the METS Server can help us with? |
IMO, not really. There is still no clear vision of the METS Server's location on the Network architecture as partially discussed in #1035. However, from what I remember was that the METS server should be started, used, and stopped by the workers/processor servers that actually run the OCR-D processors on the specific workspace. So, I would rather keep that as a low-level detail that workers know about but the processing server does not. The processing server should just buffer the incoming requests of a specific workspace internally and pass the messages into the designated worker/processor server queue. Of course, with that approach, it is a bit inefficient to start/stop the METS server for each processing message. However, I guess a bit of inefficiency would be preferable to having a more complex design which may require more error handling and debugging in the future. |
Yes, but IMO we have now come to the point where we need to mold that vision into shape. If we do not want to use the METS Server as synchronisation mechanism for workspaces, we at least have to have a good reason for doing so. I currently cannot see any.
But IIRC efficency (avoiding unnecessary METS de/serialisation between workflow steps or even between pages) was the very reason (besides synchronisation for page parallelisation) we conceived of the METS Server in the first place. (I cannot find a written discussion, I think it was @kba's idea, but here is an early mention.) Put another way: what is the point of spinning up a METS Server for each processing step and tearing it down afterwards? – We could just initially parse and finally generate the METS file instead. Also, when you first outlined the error handling architecture you placed that task at the Processing Server (not the Workers) yourself. I still see it that way. It could be as simple as "start if it does not already run, stop if not needed for more than 5 min". If all METS operations could be delegated to the METS server (because it is guaranteed to run), then they would automatically get synchronised (whether in a page parallel processsor or in a multi-processor workflow) and get serialised as late as possible. The latter could be implicit (when the Processing Server is done processing a single client-side request without METS Server parameter), or explicit (when a Workflow Server requires it at the end of a workflow, or when the Workspace Server is requested to provide results). |
Agree.
To make it clear I am not against the METS Server for synchronization. I was rather discussing the responsibilities for the METS Server - which module starts/stops it.
For internal page-wise processing on the worker side?
Right, but back then the scenario of placing workers on different hosts was not considered. Okay, we can stick to that choice. In that case, the Processing Server will spawn METS servers and provide the respective address in the processing request to the workers/processor servers.
I am almost sure defining the optimal time to decide when to stop the server would not be an easy task and will be error-prone. I would rather not rely on time to decide when to stop the METS server. Since the workers utilize a callback mechanism, the METS server should be shut down when the last processing step is finished. |
Oh, of course, for that it still makes sense.
True. But isn't that even more reason to centralise METS Server management? If my interpretation is correct that METS Servers always run where the workspace is (i.e. no data file transfers through the METS Server), then for external Workers the problem would merely be about
Yes! And in addition, requests to the Processing Server should get an optional METS Server parameter, so its lifetime can be controlled externally.
Indeed, for callback listeners, the write-back must be as early as possible anyway. |
Yes, I am closing it. I will now go through all other |
In the Processing Server, we currently add jobs to the queue unconditionally, without checking whether any job is already running on the respective workspace:
core/ocrd_network/ocrd_network/processing_server.py
Line 281 in b713030
Obviously, this will create inconsistent (or at least surprising) results when another request is made for the same workspace while running a job, or if two requests are made for the same workspace before either of them is scheduled.
IMO (since everything goes through it) we can simply look up the database for effective locking. We could add a method
db_has_processing_job(workspace_id)
which checks whether there are anyQUEUED
orRUNNING
jobs for that workspace.But what happens then? Ideally, the Processing Server would itself "wait" for the workspace to become available before actually enqueuing the new job. And that of course must be atomic. And accomodate our asynchronous model.
One idea would be to enqueue the job in a "hidden" queue, which only the Processing Server itself subscribes to and can be identified via the workspace ID (e.g.
queue_name="workspace." + data.workspace_id
). And when the Processing Worker is done, it sends the OcrdResultMessage not only to the OcrdProcessingMessage's result queue and callback URL, but to a constant uniqueresult
queue as well. That queue in turn gets subscribed to by the Processing Server, in a background thread independent of the requests. When a result message is consumed on it, the Processing Server checks if there is any "hidden" OcrdProcessingMessage for the same workspace ID, and if so, consumes it and enqueues it on the actual processor queue.(But I think that would still not be atomic. Perhaps instead of the result queue mechanism and background thread, we should use the callback mechanism and add a regular endpoint like a DELETE to
/processor/{processor_name}/{job_id}
– to the Processing Server.)What I find intriguing is the perspective that with the additional machinery, we could avoid the need to poll the job status in the Workflow Server: the latter could simply post/push all step jobs for that workspace right away, and rely on the Processing Server to ensure they are executed in order. (One still needs conditionality on success status, but that's also currently needed in the Workflow Server.)
The text was updated successfully, but these errors were encountered: