Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Process Table Columns Sortable #5360

Merged
merged 26 commits into from
Oct 25, 2022

Conversation

thomaslow
Copy link
Collaborator

@thomaslow thomaslow commented Sep 23, 2022

Related Issues:

This pull request makes all columns of the process table sortable. In order to achieve this, the data for each column needs to be available as a property in the elastic search index. Some columns were previously generated on-the-fly, such that sorting them was not possible.

These column values are now calculated at indexing time. As a consequence, indexing speed may be slightly reduced, and UI performance (of the process table) should have slightly improved.

This pull request requires a full re-indexing (including an update of the elastic search mapping).

Demo of Process Table:

simplescreenrecorder-2022-09-23_14.26.28.mp4

Columns that were previously calculated on-the-fly are now added to the elastic search index and calculated at indexing time instead, e.g., the user name of the user that last worked on a task of the process.

The code that calculated these information has been moved to the Kitodo-DataManagment project in order to be available at indexing time.
* @return the list of tasks of the process (and potentially its children)
*/
private static List<Task> getListOfTasksForProgressCalculation(Process process) {
private static List<Task> getListOfTasksForProgressCalculation(Process process, Boolean considerChildren) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using nullable Boolean class instead of primitive boolean as the values should only be true or false ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Henning, I didn't really think about it. Fixed it below.

@thomaslow
Copy link
Collaborator Author

thomaslow commented Sep 26, 2022

@solth In order for the process status to be sortable, it needs to be kept up-to-date in the ElasticSearch index whenever a task status (OPEN, DONE, INWORK, LOCKED) is changed. This requires that the accompanying process is re-indexed whenever a task status is changed. This conflicts with a pull request from last year, see #4543, which presumably improved performance when saving tasks.

How do we deal with this?

  • Re-index processes whenever tasks are saved, potentially reducing performance
  • Don't index process progress (and thus, make it not sortable)

Edit: the same applies to "lastEditingUser", which of course also depends on changes to tasks of a process

@thomaslow thomaslow marked this pull request as draft September 26, 2022 11:12
@thomaslow
Copy link
Collaborator Author

The performance problem mentioned in the previous comment is fixed now by calculating the process status via a custom SQL query. This SQL query allows to efficiently count how many tasks have a certain state (open, locked, inwork, done) even for parent processes that may have hundreds of child tasks (e.g. newspaper processes). Previously, this counting would take more than 1 second for parent processes with many children.

The SQL query uses an SQL statement (with recursive) that is not supported by Hibernate but works with most databases. I checked support for MySQL 8+, MariaDB 10.2.2+, H2Database 1.4+, but other databases should work too, e.g. PostgreSQL, Oracle. The syntax is the same for all databases I know of.

In case the SQL query does not work, there is a chance that the resulting exception is intercepted and the previous bean-based status calculation is used as a fallback. However, this depends on how the query fails.

I also added a selenium test to check that sorting processes by state works.

@thomaslow thomaslow marked this pull request as ready for review October 6, 2022 14:10
@thomaslow
Copy link
Collaborator Author

Merged with master so github only shows code changes to current master.

@solth
Copy link
Member

solth commented Oct 19, 2022

The performance problem mentioned in the previous comment is fixed now by calculating the process status via a custom SQL query.

So that means re-indexing processes whenever a task is saved does not impose any performance problems anymore? Can you elaborate a little more how using a recursive SQL query helps in this case? If I am not mistaken, #4543 improved performance by reducing read/write operations on the index, not on the database. How does an optimised SQL query have an effect on that?

Copy link
Member

@solth solth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this pull request. I tested it and it works very well with the small list of processes I have on my local development system. I cannot say whether the changes concerning indexing have any negative impact on the performance on larger systems with many processes, though.

On the code side I found only a handful of minor issues like a few typos and unused imports. I am unsure about how progress for processes without own tasks is calculated. Perhaps you could just comment on my question (see below) and elaborate a little more on that specific part.

Map<TaskStatus, Integer> counts = countTaskStatusOfProcess(process, considerChildren);
Integer total = counts.values().stream().mapToInt(Integer::intValue).sum();

// report processes without any tasks as if they had a single locked task
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this. Does this mean the task progress of a process without own tasks is always encoded as "100% locked tasks"? Shouldn't it instead be illegal (e.g. throw an exception) to try to determine the "task progress" of a process without workflow/tasks (when the parameter considerChildrenis false)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this was implemented this way. This was implemented the same before in ProcessService.

Does this mean the task progress of a process without own tasks is always encoded as "100% locked tasks"?

Yes. Unless considerChildren=true and the process has children processes with tasks that are not locked.

Shouldn't it instead be illegal (e.g. throw an exception) to try to determine the "task progress" of a process without workflow/tasks (when the parameter considerChildrenis false)?

Maybe.

@@ -203,7 +204,7 @@ private boolean startExport(Process process, LegacyMetsModsDigitalDocumentHelper

private boolean exportCompletedChildren(List<Process> children) throws DataException {
for (Process child:children) {
if (processService.getProgress(child.getTasks(), null).equals(COMPLETED) && !child.isExported()) {
if (ProcessConverter.getCombinedProgressAsString(child, false).equals(COMPLETED) && !child.isExported()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't calling ProcessConverter.getCombinedProgressAsString with second parameter considerChildren = false with an parent process without own tasks always result in "000000000100", e.g. 100% locked (as oposed to the the expected "100000000000", e.g. 100% completed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this list of children contains processes that have no tasks, they are not catched by this if statement, correct. I don't know what this export code is doing, which is why I didn't change the implementation of how the progress status is calculated.

@thomaslow
Copy link
Collaborator Author

So that means re-indexing processes whenever a task is saved does not impose any performance problems anymore?

No, there are still performance problems related to the general save-strategy in Kitodo as outlined in #5368. The pull request #5371 slightly improves on this issue a bit for processes with parents. Since this pull request requires that processes are re-indexed when their task status changes, it doesn't make things easier, but there is no other way at the moment.

Can you elaborate a little more how using a recursive SQL query helps in this case?

My original assesment of the problem in this comment was a bit incomplete. The process of a task was already re-indexed when its task status changed (except in one case, see line 94 of the WorkflowControllerService.java). So, the conflict with #4543 seems to be resolved without reverting it. Still, without always re-indexing a process whenenver a task changes, I can not guarantee that there might be some code somewhere, which modifies a task outside of the logic of WorkflowControllerService.java without saving and re-indexing its process (such that the process-index will not contain the most up-to-date information).

Besides that, the performance problem of this pull request was caused by two issues:

  1. Changing a task status triggers a save-operation for its process, which leads to the performance issue described in Improve performance when saving and indexing processes, tasks, etc. #5368, which caused many re-indexing operations for its parent processes. This still is a problem, but can not be fixed by this pull request.

  2. Calculating the task status of a parent process took a very long time when retrieving all related tasks for a parent process via bean getter methods (each triggering a database query for the tasks of a child process). This issue was multiplied by problem 1, because parent processes were saved many times (re-calculating the process status multiple times). The custom SQL query allows to efficiently calculate the process status for parent processes (which previously was only partially implemented by Kathrin via the sortHelperStatus property). It is still (unnecessarily) calculated multiple times, but much faster now.

I hope this clears things up a bit.

thomaslow and others added 4 commits October 19, 2022 12:26
Co-authored-by: Arved Solth <[email protected]>
Co-authored-by: Arved Solth <[email protected]>
Co-authored-by: Arved Solth <[email protected]>
@thomaslow
Copy link
Collaborator Author

@solth I commited all of your suggested changes. Thank you for your review.

Copy link
Member

@solth solth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing the change requests. I still want to make a few tests, so I haven't merged this pull request, yet.

Since some of your other PRs have been merged in the meantime, a small code conflict arose. Could you resolve this conflict while I condict my tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Process list not sortable by progress
3 participants