Limit case duplicate merging comparison based on creation date and archived status [3] #11465

MartinWahnschaffe · 2023-02-08T15:31:29Z

Problem Description

In #9054 we have improved the performance of the case duplicate merging query. This doesn't mean that the duplicate detection is fast in any case. The whole process still executes a lot of comparison logic. In the example the query had to compare 372 of the 1500 cases with each of the ~85k cases, resulting in 31 mio. comparisons. You can easily see where this grows when any of both sides grows.

Proposed Change

Decrease the comparison amount by doing the following:

Backend: Only compare cases with cases that were created in the same period or before (currently we are comparing with all other cases). You will find the duplicates that would fall out of the result as soon as you looked at the future period. In addition it has the benefit that redoing a check for the same week will not give you new results, because new cases have been created in later weeks.
UI: Introduce a filter for the case status (active, archived, all). By default only check active cases. The CaseCriteria already has the option EntityRelevanceStatus for this, so only the UI needs to be extended for this
UI: Set the default filtered period for the creation date to the last week. I'd suggest to use DateHelper.getPreviousEpiWeek in combination with getEpiWeekStart and getEpiWeekEnd for this.

Acceptance Criteria

Implementation Details

The join from the second case to the person needs to use a sub query for the person that orders the persons by id. This will lead to a much better query planning, because postgres realizes that it's beneficial to materialize the joined data.

The overall resulting query should look like this (when called with the admin), with the modified sections being highlighted. Note that this is the variant when only not-archived cases are querried.

select case0_.id as col_0_0_, case1_.id as col_1_0_, case0_.creationDate as col_2_0_
from cases case0_ cross join cases case1_
left outer join Person person2_ on case0_.person_id=person2_.id
left outer join Symptoms symptoms3_ on case0_.symptoms_id=symptoms3_.id
left outer join (SELECT * FROM Person ORDER by id) person4_ on case1_.person_id=person4_.id
left outer join Symptoms symptoms5_ on case1_.symptoms_id=symptoms5_.id

where case0_.deleted=FALSE and case1_.deleted=FALSE
and case0_.archived=FALSE and case1_.archived=FALSE
and case0_.creationDate>'2021-12-30' and case0_.creationDate<'2021-12-31'
and case0_.deleted=FALSE and similarity(((person2_.firstName||' ')||person2_.lastName),((person4_.firstName||' ')||person4_.lastName))>0.65
and case0_.disease=case1_.disease
and abs(date_part('EPOCH',case0_.reportDate)-date_part('EPOCH',case1_.reportDate))<=2592000
and (person2_.sex is null or person4_.sex is null or person2_.sex='UNKNOWN' or person4_.sex='UNKNOWN' or person2_.sex=person4_.sex)
and (person2_.birthdate_dd is null or person2_.birthdate_mm is null or person2_.birthdate_yyyy is null or person4_.birthdate_dd is null or person4_.birthdate_mm is null or person4_.birthdate_yyyy is null or person2_.birthdate_dd=person4_.birthdate_dd and person2_.birthdate_mm=person4_.birthdate_mm and person2_.birthdate_yyyy=person4_.birthdate_yyyy)
and (symptoms3_.onsetDate is null or symptoms5_.onsetDate is null or abs(date_part('EPOCH',symptoms3_.onsetDate)-date_part('EPOCH',symptoms5_.onsetDate))<=2592000)
and case1_.creationDate<case0_.creationDate
and case1_.creationDate<'2021-12-31'

order by case0_.creationDate desc

limit 100;

Additional Information

It's not clear yet how the users manual will be updated.

MartinWahnschaffe · 2023-02-08T16:56:58Z

I have tried the effect of my proposed changes by simply adjusting the SQL accordingly and analyzing the result.

Before the changes it took 15 seconds to execute the query: https://explain.dalibo.com/plan/2c1f66374421cc85

With the changes it is way slower and takes 58 seconds: https://explain.dalibo.com/plan/e5659984276g047d

The reason seems to be that the original query is executing using materialize:

While the new query just uses index scans that are 150x underestimated:

Not sure how to influence the query planer to get better results.

MartinWahnschaffe · 2023-02-10T13:11:02Z

Findings with @stefanspiska

Increasing the limit to 200 lead to postgres deciding that using materialize is a better option taking 34 seconds (so still worse):
https://explain.dalibo.com/plan/2bgh1eh55037785c

Forcing postgres to order the persons joined for the comparison cases by id leads to a merge join in combination with materialize and is a lot faster - 8 seconds:
https://explain.dalibo.com/plan/4167h3gc076e92fe

The same can be achieved with a JOIN LATERAL.

I have updated the implementation details accordingly.

sergiupacurariu · 2023-03-14T10:30:50Z

please check the query plans for the initial situation and the situation in which uses a subquery in the second person join. The queries were run on a local performance db on 631k cases. Except for the join subquery, everything else is identical.

initial situation
https://explain.dalibo.com/plan/f4b710g43a6adhde
join subquery added
https://explain.dalibo.com/plan/0eaa33gf73c64b7a

Due to the above test results, I suggest skipping implementing the join subquery.

…te and archived status

…te and archived status - fix tests

…te and archived status

…te and archived status - changes after review

…ase_duplicate_merging_on_creation_and_archive #11465 - Limit case duplicate merging comparison based on creation da…

abrudanancuta · 2023-03-22T10:57:55Z

The ticket was re-opened because the new filter is not aligned and is overlapping with the number of duplicates found

…te and archived status - removed disease index

…ase_duplicate_merging_on_creation_and_archive #11465 - Limit case duplicate merging comparison based on creation da…

…te and archived status - alignement fix

…ase_duplicate_merging_fix_alignement #11465 - Limit case duplicate merging comparison based on creation da…

abrudanancuta · 2023-03-28T13:33:20Z

Validated on test-de with the version: 1.82.0-SNAPSHOT (2864773)

sergiupacurariu · 2023-04-13T15:08:29Z

One of the changes that improved the performance of the case duplicate merging query was the removal of the disease index from "cases" table. This change can also have an impact on other queries throughout Sormas. Below you can see an evaluation of the query performance in different situations and different relevant users.

### Test impact of disease index - performance db (seconds)

-------------------------	------------With disease index----------	---------Without disease index---------

	National user	Limited disease User	National user	Limited disease User
dashboard access time	8	30	11	29
case grid	4	4	6	4
contacts grid	3	4	4	4
events grid	3	3	5	3
samples grid	5	7	6	6
dashboard with data	4	180	13	90

MartinWahnschaffe added backend Affects the web backend vaadin-app Affects the Vaadin application change A change of an existing feature (ticket type) performance Issues that are meant to increase the performance of the application labels Feb 8, 2023

MartinWahnschaffe added this to the Iteration 2023-01 - 1.80.0 milestone Feb 8, 2023

MartinWahnschaffe self-assigned this Feb 8, 2023

StefanKock added the cases label Feb 9, 2023

StefanKock removed this from the Iteration 2023-01 - 1.80.0 milestone Feb 9, 2023

MartinWahnschaffe mentioned this issue Feb 9, 2023

Limit lists for duplicate merging of cases #9054

Closed

10 tasks

MartinWahnschaffe removed their assignment Feb 13, 2023

AndyBakcsy-she changed the title ~~Limit case duplicate merging comparison based on creation date and archived status~~ Limit case duplicate merging comparison based on creation date and archived status [3] Feb 20, 2023

This was referenced Feb 21, 2023

[PERFORMANCE] Merging of duplicates takes too long and sometimes leads to a Server Connection Lost #8615

Closed

Analyze and improve performance of queries based on user access logic #7734

Open

sergiupacurariu self-assigned this Feb 23, 2023

sergiupacurariu added a commit that referenced this issue Mar 16, 2023

#11465 - Limit case duplicate merging comparison based on creation da…

4a9f9fb

…te and archived status

sergiupacurariu mentioned this issue Mar 16, 2023

#11465 - Limit case duplicate merging comparison based on creation da… #11652

Merged

sergiupacurariu added a commit that referenced this issue Mar 16, 2023

#11465 - Limit case duplicate merging comparison based on creation da…

4834ff3

…te and archived status - fix tests

sergiupacurariu added a commit that referenced this issue Mar 16, 2023

#11465 - Limit case duplicate merging comparison based on creation da…

525dc30

…te and archived status

sergiupacurariu added a commit that referenced this issue Mar 17, 2023

#11465 - Limit case duplicate merging comparison based on creation da…

b5ad836

…te and archived status - changes after review

sergiupacurariu added a commit that referenced this issue Mar 20, 2023

#11465 - Limit case duplicate merging comparison based on creation da…

48cf124

…te and archived status - changes after review

sergiupacurariu closed this as completed in #11652 Mar 20, 2023

sergiupacurariu added a commit that referenced this issue Mar 20, 2023

Merge pull request #11652 from hzi-braunschweig/feature-11465_limit_c…

25f6ea1

…ase_duplicate_merging_on_creation_and_archive #11465 - Limit case duplicate merging comparison based on creation da…

abrudanancuta self-assigned this Mar 21, 2023

abrudanancuta reopened this Mar 22, 2023

sergiupacurariu added a commit that referenced this issue Mar 27, 2023

#11465 - Limit case duplicate merging comparison based on creation da…

cd24bf7

…te and archived status - removed disease index

sergiupacurariu mentioned this issue Mar 27, 2023

#11465 - Limit case duplicate merging comparison based on creation da… #11734

Merged

leventegal-she closed this as completed in #11734 Mar 27, 2023

leventegal-she added a commit that referenced this issue Mar 27, 2023

Merge pull request #11734 from hzi-braunschweig/feature-11465_limit_c…

1d0e672

…ase_duplicate_merging_on_creation_and_archive #11465 - Limit case duplicate merging comparison based on creation da…

sergiupacurariu added a commit that referenced this issue Mar 27, 2023

#11465 - Limit case duplicate merging comparison based on creation da…

8fb89b3

…te and archived status - alignement fix

sergiupacurariu mentioned this issue Mar 27, 2023

#11465 - Limit case duplicate merging comparison based on creation da… #11746

Merged

abrudanancuta added this to the Iteration 2023-03 - 1.82.0 milestone Mar 27, 2023

leventegal-she added a commit that referenced this issue Mar 28, 2023

Merge pull request #11746 from hzi-braunschweig/feature-11465_limit_c…

2864773

…ase_duplicate_merging_fix_alignement #11465 - Limit case duplicate merging comparison based on creation da…

abrudanancuta added the qa-verified Issue has been tested and verified by QA label Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit case duplicate merging comparison based on creation date and archived status [3] #11465

Limit case duplicate merging comparison based on creation date and archived status [3] #11465

MartinWahnschaffe commented Feb 8, 2023 •

edited by sergiupacurariu

Loading

MartinWahnschaffe commented Feb 8, 2023 •

edited

Loading

MartinWahnschaffe commented Feb 10, 2023 •

edited

Loading

sergiupacurariu commented Mar 14, 2023

abrudanancuta commented Mar 22, 2023 •

edited

Loading

abrudanancuta commented Mar 28, 2023

sergiupacurariu commented Apr 13, 2023 •

edited

Loading

Limit case duplicate merging comparison based on creation date and archived status [3] #11465

Limit case duplicate merging comparison based on creation date and archived status [3] #11465

Comments

MartinWahnschaffe commented Feb 8, 2023 • edited by sergiupacurariu Loading

Problem Description

Proposed Change

Acceptance Criteria

Implementation Details

Additional Information

MartinWahnschaffe commented Feb 8, 2023 • edited Loading

MartinWahnschaffe commented Feb 10, 2023 • edited Loading

sergiupacurariu commented Mar 14, 2023

abrudanancuta commented Mar 22, 2023 • edited Loading

abrudanancuta commented Mar 28, 2023

sergiupacurariu commented Apr 13, 2023 • edited Loading

MartinWahnschaffe commented Feb 8, 2023 •

edited by sergiupacurariu

Loading

MartinWahnschaffe commented Feb 8, 2023 •

edited

Loading

MartinWahnschaffe commented Feb 10, 2023 •

edited

Loading

abrudanancuta commented Mar 22, 2023 •

edited

Loading

sergiupacurariu commented Apr 13, 2023 •

edited

Loading