Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance when Generating Newspaper Processes #5093

Merged

Conversation

thomaslow
Copy link
Collaborator

@thomaslow thomaslow commented Apr 27, 2022

Related and possibly fixed issues:

The problem is caused by a slow indexing performance of projects in ElasticSearch. The following snippet shows how a project with many processes is indexed at the moment:

{
	"_index": "kitodo_project",
	"_type": "_doc",
	"_id": "1",
	"_score": 1,
	"_source": {
		"active": true,
		"title": "Example Project",
		"processes": [
			{
				"id": 1,
				"title": "Process Title"
			},
			{
				"id": 2,
				"title": "Another Process Title"
			},
			# ...
			# repeating potentially over 10.000 times depending on the number of processes
			# ...
		],
		"startDate": "2016-01-01 00:00:00",
		"endDate": "2019-12-31 00:00:00",
		"numberOfVolumes": 0,
		"templates": [...],
		"users": [...],
		"folder": [...],
		"numberOfPages": 0,
		"metsRightsOwner": "Digital Library Kitodo",
		"client.name": "Client_ChangeMe",
		"client.id": 1
	}
}

For large projects with many >10.000 of processes, the search itself is still fast (ElasticSearch reports a query time of 2ms), howewer the json document becomes huge and will take a lot of time to parse. During the newspaper generating process, the corresponding project is saved repeatedly after generating a new newspaper process, so potentially hundereds of times, each time the project is indexed again with a single new entry added to the list of processes.

The proposed solution removes the list of processes from the index mapping, and replaces it with a property hasProcesses, which is required to allow to disable the "delete project icon" when showing a list of projects.

A disadvantage of this solution is that projects can no longer be searched for based on the name of their processes. As of my knowledge, the user interface currently does not support a keyword-based search for projects anyway. However, if it is desired to support such a search scenerio in the future, the next best solution would be to improve the newspaper generation task to not save a project multiple times while generating new processes (which has problems as well, e.g., an inconsistent database during the generation task).

A positive side effect of this solution is that both the indexing and query time of projects has improved, which improves the performance when re-indexing all projects, or even loading the dashboard user interface.

@solth
This might also be a problem in the Hibernate-Search branch, see line:

@IndexedEmbedded(includePaths = {"id", "title"})

effective-webwork@1a5de7d#diff-5f7246e37b075cd985e0d6b3bdeeb919b5368811377ab0a3517df08897f3c576R111

thomaslow and others added 3 commits April 21, 2022 19:44
…h project in elastic search.

This list was only indexed in order to determine whether a project has any processes attached, such that the delete project icon could be enabled or disabled. This improves the indexing performance dramatically for large projects, because rendering and parsing long lists of thounsands of processes as json requires a lot of resources.

Requires re-indexing to take effect, since elastic mapping was changed.
@Kathrin-Huber Kathrin-Huber requested a review from solth April 29, 2022 08:28
@Kathrin-Huber Kathrin-Huber merged commit a72786c into kitodo:master May 4, 2022
@thomaslow thomaslow deleted the performance-when-generating-newspapers branch May 6, 2022 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants