Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow re-running #26

Merged
merged 2 commits into from
Jul 1, 2022
Merged

allow re-running #26

merged 2 commits into from
Jul 1, 2022

Conversation

bertsky
Copy link
Member

@bertsky bertsky commented Jun 28, 2022

Fixes slub/ocrd_kitodo#17

Note that due to OCR-D/core#825 this still does not work completely, but there's nothing we can do about that on our side.

@bertsky
Copy link
Member Author

bertsky commented Jun 28, 2022

Also, note that this does not cope with re-entering the same job again (which would also make it re-use the Controller-side REMOTE_DIR), only the same process. (We currently have no entry point for rerunning jobs from Kitodo.)

@markusweigelt
Copy link
Collaborator

@bertsky Should we create a make target to delete files of WORKDIR of Manager or Controller? Or should we add a parameter to force deleting WORKDIRS before processing?

@bertsky
Copy link
Member Author

bertsky commented Jun 30, 2022

Also, note that this does not cope with re-entering the same job again (which would also make it re-use the Controller-side REMOTE_DIR), only the same process.

But thinking about it, it may at least help to name REMOTE_DIR without the variable $PID, only with $PROCESS_ID and $TASK_ID, which are constant and therefore should allow for rerunning the job (regardless of how this might be triggered). What do you think?

@bertsky
Copy link
Member Author

bertsky commented Jun 30, 2022

Should we create a make target to delete files of WORKDIR of Manager or Controller? Or should we add a parameter to force deleting WORKDIRS before processing?

No, I wouldn't do that via makefile. There's a to-do in the comments to cron-schedule the removal on the Controller, which should suffice. And for the Manager, the same mechanism that removes finished process data should also be responsible for the ocr-d/ side, so that's outside of the Manager's scope.

@markusweigelt
Copy link
Collaborator

markusweigelt commented Jun 30, 2022

Also, note that this does not cope with re-entering the same job again (which would also make it re-use the Controller-side REMOTE_DIR), only the same process.

But thinking about it, it may at least help to name REMOTE_DIR without the variable $PID, only with $PROCESS_ID and $TASK_ID, which are constant and therefore should allow for rerunning the job (regardless of how this might be triggered). What do you think?

I think the $TASK_ID is not suitable to put this value in the scope of the Controller, cause it is a Kitodo.Production specific and Controller should be independent from application who triggers ocr process. I think it should be the $PROCESS_ID with prefix of application e.g. "Production", "Presentation" hand over by the Manager or something else to distinguish between scripts e.g. for_production, for_presentation ...

@bertsky
Copy link
Member Author

bertsky commented Jun 30, 2022

I think the $TASK_ID is not suitable to put this value in the scope of the Controller, cause it is a Kitodo.Production specific and Controller should be independent from application who triggers ocr process. I think it should be the $PROCESS_ID with prefix of application e.g. "Production", "Presentation" hand over by the Manager or something else to distinguish between scripts e.g. for_production, for_presentation ...

It's not in the "scope" of the Controller, though. It's the Manager's choice. The Controller just gets a path name (ideally not conflicting with anything else). And I think that having TASK_ID in there is actually correct: Suppose you have a workflow with two places for OCR(-D): once for the page layout and text on the images, then some more steps in Production including export, and then again for document layout on the presentation METS. The second time must not clash with the first time, i.e. it should have different directories on the Controller.

@markusweigelt
Copy link
Collaborator

Ok that is a good point. I have to think about this a bit more because tasks can be deleted etc. but for the moment it sounds like the best way.

@@ -58,7 +58,7 @@ init() {
logger -p user.error -t $TASK "insufficient permissions on /data volume"
exit 5
fi
REMOTEDIR="KitodoJob${PID}_$(basename $PROCESS_DIR)"
REMOTEDIR="KitodoJob_${PROCESS_ID}_${TASK_ID}_$(basename $PROCESS_DIR)"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest rename to KitodoProductionJob to distinguish between "Kitodo.Presentation" later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? The directory name should not be used for provenance. If such information is needed, we should rather annotate directly in the METS.

(Also, I'd rather like to reduce Production-specific settings in ocrd_lib.sh...)

@markusweigelt markusweigelt merged commit 64f2780 into slub:main Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow re-runs
2 participants