Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

METS Server #966

Merged
merged 50 commits into from
Aug 22, 2023
Merged

METS Server #966

merged 50 commits into from
Aug 22, 2023

Conversation

kba
Copy link
Member

@kba kba commented Dec 9, 2022

PR is for early feedback, this is a proof-of-concept, not yet ready for wider use because it is not tested systematically yet. (it is tested and largely consolidated now 🤞)

METS server can be started with:

# Start server listening to a UNIX domain socket
ocrd workspace --mets-server-url /tmp/ws.sock -d /path/to/workspace server start
# Start server listening to TCP port
ocrd workspace --mets-server-url http://localhost:8123 -d /path/to/workspace server start

Note: If you want to use the TCP interface, you must prefix the --mets-server-url with http://, otherwise it will be interpreted as a UDS.

Then you can do things like OcrdMets.find_files via http:

http get localhost:8123 file_grp=OCR-D-IMG "mimetype=//image/.*"

The same can be achieved through ocrd workspace:

ocrd workspace -U /tmp/ws.sock find -G OCR-D-IMG -m "//image/.*"

And processors also accept -U/--mets-server-url:

ocrd-tesserocr-recognize -U /tmp/ws.sock -I OCR-D-IMG-BIN -O TESS

Processors do still need to specify the workspace directory because they read files from disk. Should we go further and provide a way to download files from the METS server? We should not.

The implementation (in ocrd.mets_server) works like this:

  • OcrdMetsServer is the server component. It is provided with a workspace and calling its startup method creates a fastapi app that is run with uvicorn.
  • ClientSideOcrdMets is a replacement for OcrdMets that delegates to the OcrdMetsServer instead of the lxml DOM.
  • ClientSideOcrdFile is the equivalent replacement for OcrdFile
  • ClientSideOcrdAgent is the equivalent replacement for OcrdAgent
  • Messages between server and client are encoded as pydantic models

OcrdWorkspace, ocrd_decorators etc. accept mets_server_url in addition to mets_url, mets_basename etc. If --mets-server-url is specified, the OcrdWorkspace.mets is instantiated as a ClientSideOcrdMets instead of OcrdMets. The idea behind this, is that the change in behavior should be largely transparent, there are very few changes to OcrdWorkspace itself, only how the METS is accessed is different, but processor developers and users of ocrd workspace should not need to change anything to get support for METS server.

As I said, this is a proof-of-concept at this stage, there some hacky solutions in it and I haven't yet started to unit/integration test this properly. There is now a basic test.

I would appreciate feedback on

  • the fastapi implementation this is good enough to move forward I think
  • whether downloading from the server should be A Thing It should not
  • whether we need both TCP and UNIX socket Yes, adds little complexity but offers flexibility
  • how best to test this Testing with a pytest fixture where the server is run via socket in a separate process, parallel additions tested with process pools

@kba kba requested review from MehmedGIT and joschrew December 9, 2022 12:57
Copy link
Contributor

@MehmedGIT MehmedGIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the initial portion of my feedback by just peeking over the code. More will come later this week.

ocrd/ocrd/workspace.py Show resolved Hide resolved
ocrd/ocrd/mets_server.py Outdated Show resolved Hide resolved
ocrd/ocrd/mets_server.py Outdated Show resolved Hide resolved
ocrd/ocrd/mets_server.py Show resolved Hide resolved
ocrd/ocrd/mets_server.py Outdated Show resolved Hide resolved
ocrd/ocrd/resolver.py Outdated Show resolved Hide resolved
Copy link
Contributor

@joschrew joschrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, here is my review. I read through all the code and briefly tested the new feature.

This review-comment has 3 parts: questions, uds, tests. Feedback to fastapi are included in my code-comments

Questions:

  1. Maybe we/you could point out the idea of the Mets-Server again (then it is better for me to recap): IIUC the idea is that all interactions with a workspace are done by the Mets-Server. So there is just one process (the server) accessing the Metsfile. What I am wondering is: when a user initiates multiple calls to the server how is it ensured that the Metsfile is accessed in a "synchronized way". Can't it be for example that the Met-Server itself receives 2 requests to add a file to the Metsfile and this calls "happen at the same time" and thus the writes collide somehow? Or is this related somehow to the Metsfile caching? Or because python only has one thread?

Processors do still need to specify the workspace directory because they read files from disk

I don't understand what you mean by this. Does it mean in addition to host/port the path to the workspace has to be specified? If so why, as far as I understand the Mets-Servers is the only one who needs to know where the Metsfile is stored.

whether downloading from the server should be A Thing

sry, I don't understand that either, can you try to clarify? Do you mean we should thing about if we should offer the functionality to download the Metsfile and the filegroups and everything?

Unix sockets

I don't have much experience with unix sockets. What is the benefit in using sockets compared to using tcp? I am wondering if it's worth the added "complexity" because both (uds and tcp) offer nearly the same thing (iiuc). What I would do is go with tcp and drop the sockets at least for the beginning (although it seems to be nicely usable with uvicorn/fastapi).

Tests

I didn't think very much about the tests yet. I am wondering if it is time for that yet, I know about TDD but I think I am not experienced/used to do that.
But in the webapi-implementation we already have some tests regarding fastapi. There we also use pytest-docker because we need mongodb (and later probably rabbit-mq) as well. I am not sure weather we need this also for the metsserver yet. Currently I don't think so.
Testing api endpoints with fastapi is in my opinion simple (using docker is a bit more complicated from my point of view). First a fixture (not mandatory but I think a fixture is the best way) for the fastapi-Instance/TestClient is needed: https://github.com/OCR-D/ocrd-webapi-implementation/blob/main/tests/conftest.py#L33-L36 (let's pretend as if in the example the function has no parameter) and then it can be used to execute http methods: https://github.com/OCR-D/ocrd-webapi-implementation/blob/main/tests/conftest.py#L176-L177. The rest (starting the server etc) is done by fastapi in the background.

ocrd/ocrd/workspace.py Show resolved Hide resolved
ocrd/ocrd/mets_server.py Show resolved Hide resolved
ocrd/ocrd/mets_server.py Show resolved Hide resolved
ocrd/ocrd/mets_server.py Outdated Show resolved Hide resolved
ocrd/ocrd/mets_server.py Show resolved Hide resolved
ocrd/ocrd/mets_server.py Show resolved Hide resolved
ocrd/ocrd/mets_server.py Outdated Show resolved Hide resolved
@MehmedGIT
Copy link
Contributor

MehmedGIT commented Dec 16, 2022

Hi Jonas. I will try to provide some insight on the questions you asked as far as I understand, Konstantin can then modify/extend the answers. In the end, hopefully, all of us will be on the same page.

  1. Maybe we/you could point out the idea of the Mets-Server again (then it is better for me to recap): IIUC the idea is that all interactions with a workspace are done by the Mets-Server. So there is just one process (the server) accessing the mets file.

That's correct.

What I am wondering is: when a user initiates multiple calls to the server how is it ensured that the Metsfile is accessed in a "synchronized way". Can't it be for example that the Met-Server itself receives 2 requests to add a file to the Metsfile and this calls "happen at the same time" and thus the writes collide somehow? Or is this related somehow to the Metsfile caching? Or because python only has one thread?

Good question. Still not completely clear to me as well. On a low level, how does the Session from requests_unixsocket work? I did not check that yet.
The plan is to store the changes in the server's RAM and save the content of the mets file on the disk only when requested. Or store it every few seconds depending on how it is configured (to reduce the amount of loss of data if a crash happens). If we extend the mets caching (currently, we don't cache everything), we may also just store the compact cache on the disk for recovery purposes. I assume the requests are accepted in async way, but we do synchronize writes to the XML tree/cache simply by locking/unlocking the XML tree/cache.

Processors do still need to specify the workspace directory because they read files from disk

I don't understand what you mean by this. Does it mean in addition to host/port the path to the workspace has to be specified? If so why, as far as I understand the Mets-Servers only needs to know where the Metsfile is stored.

We have two different options for the mets server:

  1. UDS (socket file) - Consider the provided example:
    ocrd-tesserocr-recognize --socket /tmp/ws.sock -I OCR-D-IMG-BIN -O TESS
    Here the processor knows where to write, i.e., the socket file (stream-oriented). But the processor also needs to know where to read from the fileGrp OCR-D-IMG-BIN files. So, to the example above the path of the mets file has to be added. This should be already obvious - works only when the mets server and the ocrd processor are running on the same host.

  2. TCP (host/port) - In this case, the processors still need to know the workspace_id (the webapi-impl terminology) from where they need to read. They don't know how/where the data is stored inside the server.

whether downloading from the server should be A Thing

sry, I don't understand that either, can you try to clarify? Do you mean we should thing about if we should offer the functionality to download the Metsfile and the filegroups and everything?

I guess, yes - I have the same understanding. Whether the mets server should download missing files. Although I think this does not include downloading the mets file, just whatever is referenced inside the mets file.

Unix sockets
I don't have much experience with unix sockets. What is the benefit in using sockets compared to using tcp? I am wondering if it's worth the added "complexity" because both (uds and tcp) offer nearly the same thing (iiuc).

In short - the Unix socket is used for interprocess communication (IPC) and is an alternative to native IPC (e.g., pipes and fifos) . In cases where the mets server and the ocrd processor/s are running on the same host, the communication happens in a more efficient way (check here and here). The data transferred over the socket can also be further compressed - e.g., using a message scheme instead of relying on the slower JSON schema (EDIT: message compression is possible over TCP as well, but usually not preferred if users interact with the server directly and they need meaningful responses, JSON is ideal in that case). The mets server will be accessed by ocrd processors and not by users directly. The user is only interested in the latest mets file state once the processing is finished.

Another HPC-specific advantage is that it completely removes any possibility of potential port clashes with other running services. When I asked a more experienced person about starting a mets server inside the HPC, I've been told to always use ports above 10000 to reduce the possibility of potential port clashes with internal services. However, this is still not safe enough in the long run since it can still potentially clash with another user service. The more mets server instances inside the HPC, the more probability of a clash.

What I would do is go with tcp and drop the sockets at least for the beginning (although it seems to be nicely usable with uvicorn/fastapi).

I would not prefer to drop the sockets. However, I think we should separate the host/port from the socket concept a bit, so it is more obvious to the general user what are these. Currently, they are kind of under the same hood. Even a slightly better help description on the CLI may help here.

ocrd/ocrd/mets_server.py Outdated Show resolved Hide resolved
@kba
Copy link
Member Author

kba commented Dec 19, 2022

What I am wondering is: when a user initiates multiple calls to the server how is it ensured that the Metsfile is accessed in a "synchronized way". Can't it be for example that the Met-Server itself receives 2 requests to add a file to the Metsfile and this calls "happen at the same time" and thus the writes collide somehow? Or is this related somehow to the Metsfile caching? Or because python only has one thread?

Good question. Still not completely clear to me as well. On a low level, how does the Session from requests_unixsocket work? I did not check that yet.

The METS server is single-threaded, so any concurrent requests will work on the same in-memory OcrdWorkspace/OcrdMets object. This behavior is unrelated to caching, though caching will speed up file search here as well of course.

whether downloading from the server should be A Thing

sry, I don't understand that either, can you try to clarify? Do you mean we should thing about if we should offer the functionality to download the Metsfile and the filegroups and everything?

I guess, yes - I have the same understanding. Whether the mets server should download missing files. Although I think this does not include downloading the mets file, just whatever is referenced inside the mets file.

The way the METS server is currently implemented still requires processors to be run with a local copy of the workspace, only file adding and searching (and a few related methods) are implemented, access to the actual files happens via filesystem. I was wondering whether we should decouple this further, so that a processor can request a file and download it via the METS server, store it in a local scratch directory (or just keep in memory if feasible), do the processing and upload the file again. That way there would be no need to have a processor-local full workspace at all.

The other interpretation does also make sense though, i.e. downloading misssing files to the server-local workspace, like we do with the --download flag to ocrd workspace find. Perhaps we should add a query parameter download=1 to GET /file.

What I would do is go with tcp and drop the sockets at least for the beginning (although it seems to be nicely usable with uvicorn/fastapi).

I would not prefer to drop the sockets. However, I think we should separate the host/port from the socket concept a bit, so it is more obvious to the general user what are these. Currently, they are kind of under the same hood. Even a slightly better help description on the CLI may help here.

I would also prefer to keep both options for the reasons @MehmedGIT explained so eruditely above. If you have suggestions on how to better describe the features, I'll be happy to integrate them.

@MehmedGIT
Copy link
Contributor

MehmedGIT commented Dec 19, 2022

The way the METS server is currently implemented still requires processors to be run with a local copy of the workspace, only file adding and searching (and a few related methods) are implemented, access to the actual files happens via filesystem.

Since workspace, depending on the context, could mean both a mets file and images, I think we should stick to these separate terms when referring to workspaces. At least for me, it will help. So, reads/writes on the mets file happen through the Mets server, but ocrd processors still have the images stored locally. Although there is a mets file available locally as well, it is not used by the processors.

I was wondering whether we should decouple this further, so that a processor can request a file and download it via the METS server, store it in a local scratch directory (or just keep in memory if feasible), do the processing and upload the file again.

This sounds reasonable considering that the processors should know nothing regarding the workspace structure but just process images. However, we should think how to optimize that to avoid unnecessary transfers of images. E.g., uploading images from Local/Network FS to the METS server, then sending them to the processors via the METS server.

Maybe we should first try to identify a spot where the METS Server fits the best in this architecture. I think the METS Server fits the best between the Processing Servers (workers) and the Network File System. It acts as a proxy when a path to the image has to be provided or an image to be downloaded. So the Mets server will not have to store workspace images locally. Since I did not put much thought into this idea yet, maybe there are things I have not considered.

Copy link
Contributor

@MehmedGIT MehmedGIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. We already discussed some things in person but for the record here is a bug that is still there.

  1. Starting the METS Server: ocrd workspace -U http://localhost:8123 -d /home/mm/Desktop/example_ws2/data server start
  2. Then trying to run some processor: ocrd-cis-ocropy-binarize -U http://localhost:8123 -I DEFAULT -O OCR_BIN

Output:

17:29:31.758 CRITICAL root - initLogging was called multiple times. Source of latest call:
17:29:31.758 CRITICAL root -   File "/home/mm/Desktop/ocrd_core/ocrd-core/ocrd/ocrd/decorators/__init__.py", line 60, in ocrd_cli_wrap_processor
17:29:31.759 CRITICAL root -     initLogging()
Traceback (most recent call last):
  File "/home/mm/venv37-ocrd/bin/ocrd-cis-ocropy-binarize", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_binarize())
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/cli.py", line 18, in ocrd_cis_ocropy_binarize
    return ocrd_cli_wrap_processor(OcropyBinarize, *args, **kwargs)
  File "/home/mm/Desktop/ocrd_core/ocrd-core/ocrd/ocrd/decorators/__init__.py", line 80, in ocrd_cli_wrap_processor
    workspace = resolver.workspace_from_url(mets, working_dir)
  File "/home/mm/Desktop/ocrd_core/ocrd-core/ocrd/ocrd/resolver.py", line 175, in workspace_from_url
    self.download_to_directory(dst_dir, mets_url, basename=mets_basename, if_exists='overwrite' if clobber_mets else 'skip')
  File "/home/mm/Desktop/ocrd_core/ocrd-core/ocrd/ocrd/resolver.py", line 82, in download_to_directory
    raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: /home/mm/Desktop/ocrd_core/mets.xml

Seems it still tries to find the mets.xml file in the default path.

I have tried to provide the correct mets path with --mets, however, then the produced output files are not reflected by the METS Server at all.

@kba
Copy link
Member Author

kba commented Aug 21, 2023

EDIT: -d is for ocrd workspace, -w is for processors to specify workspace directory.

Overall looks good. We already discussed some things in person but for the record here is a bug that is still there.

  1. Starting the METS Server: ocrd workspace -U http://localhost:8123 -d /home/mm/Desktop/example_ws2/data server start
  2. Then trying to run some processor: ocrd-cis-ocropy-binarize -U http://localhost:8123 -I DEFAULT -O OCR_BIN

You also need to specify the path to the workspace to the processor, like so:

ocrd-cis-ocropy-binarize -U http://localhost:8123 -I DEFAULT -O OCR_BIN -w /home/mm/Desktop/example_ws2/data

Because writing out files other than mets.xml still happens on the client side.

But there were indeed a few problems in the code, that I have fixed now (mostly that --mets-server-url was not passed properly and I had a typo in the option too 🙄)

I have now tested it with https://content.staatsbibliothek-berlin.de/dc/PPN680203753.mets.xml:

  • ocrd workspace -U http://localhost:8123 -d /tmp/testws server start in one tmux pane
  • for i in seq --format='%02g' 1 10;do ocrd-tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model eng -I DEFAULT -O TESS -w /tmp/testws -U http://localhost:8123 -g PHYS_00$i&;done in the other, i.e. process the first 10 pages in parallel

It takes a moment for the processes to start, but then they go brrrr.

@kba
Copy link
Member Author

kba commented Aug 21, 2023

Tests now fail due to psf/requests#6226 but have to investigate further tomorrow.

@MehmedGIT MehmedGIT self-requested a review August 22, 2023 13:57
Copy link
Contributor

@MehmedGIT MehmedGIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more issues on my side.

@kba kba merged commit 5ce54a5 into OCR-D:master Aug 22, 2023
@kba kba deleted the mets-server branch August 22, 2023 16:13
@kba kba restored the mets-server branch August 23, 2023 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants