Skip to content

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Apr 29, 2025

Since we agreed it is not feasible to continue supporting native installation into a shared venv (along with risk of dependency clashes, necessity to find compromises or sidestep via sub-venv), which will not be needed for the network (WebAPI server-client) installation anyway, this implements the first step: keeping backwards-compatible CLI interfaces, but delegating to Docker images throughout. (The second step will be about replacing or complementing the CLI interfaces with server-client setup, in line with #449.)

  • generate executables by delegating to slim-container Docker images
  • automagically prepare a shared named volume for models with user-friendly permissions and copying pre-installed models
  • create convenient CLIs for ocrd resmgr in each slim image
  • remove unnecessary native installation rules and definitions
  • remove unnecessary fat-container Docker build rules and definitions
  • find a new solution for the ocrd-all-*.json targets
  • ...

@stweil
Copy link
Collaborator

stweil commented Apr 29, 2025

we agreed it is not feasible to continue supporting native installation into a shared venv

Did we? Will there be support for another kind of native installation? I have no intention to use a dockerized OCR-D.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 29, 2025

Did we? Will there be support for another kind of native installation? I have no intention to use a dockerized OCR-D.

We talked about this over and over, and repeatedly asked for commentary – esp. in the Tech Call. I kept this alive for a few years with lots and lots of effort, but not only is my time limited – with slim containers, there is no use for this anymore. Container images are much better anyway.

You can still install modules individually from their respective readmes if you want.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 30, 2025

So, to illustrate, if you make all (with default options DOCKER_PULL_POLICY=pull DOCKER_VOL_MODELS=ocrd-models DOCKER_RUN_OPTS="-v $(DOCKER_VOL_MODELS):/usr/local/share/ocrd-resources -v $$PWD:/data -u $$UID"), then this will docker pull all images and install a delegator shell script under venv/bin/ocrd-... for each executable, so for example ocrd-tesserocr-recognize will become:

#!/usr/bin/env bash
docker run --rm "${DOCKER_RUN_OPTS[@]}" -v ocrd-models:/usr/local/share/ocrd-resources -v $PWD:/data -u $UID ocrd/tesserocr ocrd-tesserocr-recognize "$@"

It will then proceed to build ocrd-all-tool.json and ocrd-all-meta.json from the checked out ocrd-tool.json of every submodule.

If passing DOCKER_PULL_POLICY=build, then in each checked out submodule, a respective make docker will be run to rebuild the images locally, instead of pulling them from Dockerhub.

To just pull (or build) the images, without (re-)installing the executable shell scripts, do make images.

Finally, to initialise the named volume ocrd-models from the pre-installed processor resources in the images and fix their permissions, just do make init-vol-models once.

To then manage processor resources (list or download), there are now additional delegator shell scripts for every image that just wrap the ocrd CLI, respectively. For example, to see what is installed for ocrd/tesserocr, do ocrd-tesserocr-ocrd resmgr list-installed -e ocrd-tesserocr-recognize. To install all registered models, do ocrd-tesserocr-ocrd resmgr download ocrd-tesserocr-recognize "*". (That's exactly what make install-models-tesseract now does.)

(We cannot just use ocrd resmgr for this directly, as that only delegates to the ocrd/core image, which has no other processors installed, so it does not know about any resources.)

Using the processor CLIs is as simple as calling them by name, like in the native installation before, but now these will automatically start the respective container, mount the model volume, mount the current working directory into /data (the internal CWD) and run the processor in there.

So what is not possible anymore is using multi-processor tools like ocrd process ... (as ocrd just delegates to the ocrd/core image, which has no processors besides ocrd-dummy and ocrd-filter, and you cannot spin up containers from other containers), or ocrd-make (as that's just the ocrd/workflow-configuration image, which only has ocrd-page-transform installed internally). One could install core and workflow-configuration natively, so when ocrd process ... or ocrd-make start other processors, they get to use the delegator scripts...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants