Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit case duplicate merging comparison based on creation date and archived status [3] #11465

Closed
3 of 4 tasks
Tracked by #7734
MartinWahnschaffe opened this issue Feb 8, 2023 · 6 comments · Fixed by #11652, #11734 or #11746
Closed
3 of 4 tasks
Tracked by #7734
Assignees
Labels
backend Affects the web backend cases change A change of an existing feature (ticket type) performance Issues that are meant to increase the performance of the application qa-verified Issue has been tested and verified by QA vaadin-app Affects the Vaadin application

Comments

@MartinWahnschaffe
Copy link
Contributor

MartinWahnschaffe commented Feb 8, 2023

Problem Description

In #9054 we have improved the performance of the case duplicate merging query. This doesn't mean that the duplicate detection is fast in any case. The whole process still executes a lot of comparison logic. In the example the query had to compare 372 of the 1500 cases with each of the ~85k cases, resulting in 31 mio. comparisons. You can easily see where this grows when any of both sides grows.

Proposed Change

Decrease the comparison amount by doing the following:

  • Backend: Only compare cases with cases that were created in the same period or before (currently we are comparing with all other cases). You will find the duplicates that would fall out of the result as soon as you looked at the future period. In addition it has the benefit that redoing a check for the same week will not give you new results, because new cases have been created in later weeks.
  • UI: Introduce a filter for the case status (active, archived, all). By default only check active cases. The CaseCriteria already has the option EntityRelevanceStatus for this, so only the UI needs to be extended for this
  • UI: Set the default filtered period for the creation date to the last week. I'd suggest to use DateHelper.getPreviousEpiWeek in combination with getEpiWeekStart and getEpiWeekEnd for this.

Acceptance Criteria

Implementation Details

  • The join from the second case to the person needs to use a sub query for the person that orders the persons by id. This will lead to a much better query planning, because postgres realizes that it's beneficial to materialize the joined data.

The overall resulting query should look like this (when called with the admin), with the modified sections being highlighted. Note that this is the variant when only not-archived cases are querried.

select case0_.id as col_0_0_, case1_.id as col_1_0_, case0_.creationDate as col_2_0_
from cases case0_ cross join cases case1_
left outer join Person person2_ on case0_.person_id=person2_.id
left outer join Symptoms symptoms3_ on case0_.symptoms_id=symptoms3_.id
left outer join (SELECT * FROM Person ORDER by id) person4_ on case1_.person_id=person4_.id
left outer join Symptoms symptoms5_ on case1_.symptoms_id=symptoms5_.id

where case0_.deleted=FALSE and case1_.deleted=FALSE
and case0_.archived=FALSE and case1_.archived=FALSE
and case0_.creationDate>'2021-12-30' and case0_.creationDate<'2021-12-31'
and case0_.deleted=FALSE and similarity(((person2_.firstName||' ')||person2_.lastName),((person4_.firstName||' ')||person4_.lastName))>0.65
and case0_.disease=case1_.disease
and abs(date_part('EPOCH',case0_.reportDate)-date_part('EPOCH',case1_.reportDate))<=2592000
and (person2_.sex is null or person4_.sex is null or person2_.sex='UNKNOWN' or person4_.sex='UNKNOWN' or person2_.sex=person4_.sex)
and (person2_.birthdate_dd is null or person2_.birthdate_mm is null or person2_.birthdate_yyyy is null or person4_.birthdate_dd is null or person4_.birthdate_mm is null or person4_.birthdate_yyyy is null or person2_.birthdate_dd=person4_.birthdate_dd and person2_.birthdate_mm=person4_.birthdate_mm and person2_.birthdate_yyyy=person4_.birthdate_yyyy)
and (symptoms3_.onsetDate is null or symptoms5_.onsetDate is null or abs(date_part('EPOCH',symptoms3_.onsetDate)-date_part('EPOCH',symptoms5_.onsetDate))<=2592000)
and case1_.creationDate<case0_.creationDate
and case1_.creationDate<'2021-12-31'

order by case0_.creationDate desc

limit 100;

Additional Information

  • It's not clear yet how the users manual will be updated.
@MartinWahnschaffe MartinWahnschaffe added backend Affects the web backend vaadin-app Affects the Vaadin application change A change of an existing feature (ticket type) performance Issues that are meant to increase the performance of the application labels Feb 8, 2023
@MartinWahnschaffe MartinWahnschaffe self-assigned this Feb 8, 2023
@MartinWahnschaffe
Copy link
Contributor Author

MartinWahnschaffe commented Feb 8, 2023

I have tried the effect of my proposed changes by simply adjusting the SQL accordingly and analyzing the result.

Before the changes it took 15 seconds to execute the query: https://explain.dalibo.com/plan/2c1f66374421cc85

With the changes it is way slower and takes 58 seconds: https://explain.dalibo.com/plan/e5659984276g047d

The reason seems to be that the original query is executing using materialize:
grafik

While the new query just uses index scans that are 150x underestimated:
grafik

Not sure how to influence the query planer to get better results.

@MartinWahnschaffe
Copy link
Contributor Author

MartinWahnschaffe commented Feb 10, 2023

Findings with @stefanspiska

Increasing the limit to 200 lead to postgres deciding that using materialize is a better option taking 34 seconds (so still worse):
https://explain.dalibo.com/plan/2bgh1eh55037785c

Forcing postgres to order the persons joined for the comparison cases by id leads to a merge join in combination with materialize and is a lot faster - 8 seconds:
https://explain.dalibo.com/plan/4167h3gc076e92fe

The same can be achieved with a JOIN LATERAL.

I have updated the implementation details accordingly.

@MartinWahnschaffe MartinWahnschaffe removed their assignment Feb 13, 2023
@AndyBakcsy-she AndyBakcsy-she changed the title Limit case duplicate merging comparison based on creation date and archived status Limit case duplicate merging comparison based on creation date and archived status [3] Feb 20, 2023
@sergiupacurariu sergiupacurariu self-assigned this Feb 23, 2023
@sergiupacurariu
Copy link
Contributor

please check the query plans for the initial situation and the situation in which uses a subquery in the second person join. The queries were run on a local performance db on 631k cases. Except for the join subquery, everything else is identical.

Due to the above test results, I suggest skipping implementing the join subquery.

sergiupacurariu added a commit that referenced this issue Mar 16, 2023
sergiupacurariu added a commit that referenced this issue Mar 16, 2023
sergiupacurariu added a commit that referenced this issue Mar 16, 2023
sergiupacurariu added a commit that referenced this issue Mar 17, 2023
…te and archived status - changes after review
sergiupacurariu added a commit that referenced this issue Mar 20, 2023
…te and archived status - changes after review
sergiupacurariu added a commit that referenced this issue Mar 20, 2023
…ase_duplicate_merging_on_creation_and_archive

#11465 - Limit case duplicate merging comparison based on creation da…
@abrudanancuta abrudanancuta self-assigned this Mar 21, 2023
@abrudanancuta
Copy link
Contributor

abrudanancuta commented Mar 22, 2023

The ticket was re-opened because the new filter is not aligned and is overlapping with the number of duplicates found
Screenshot 2023-03-22 at 12 52 13

@abrudanancuta abrudanancuta reopened this Mar 22, 2023
sergiupacurariu added a commit that referenced this issue Mar 27, 2023
…te and archived status - removed disease index
leventegal-she added a commit that referenced this issue Mar 27, 2023
…ase_duplicate_merging_on_creation_and_archive

#11465 - Limit case duplicate merging comparison based on creation da…
sergiupacurariu added a commit that referenced this issue Mar 27, 2023
leventegal-she added a commit that referenced this issue Mar 28, 2023
…ase_duplicate_merging_fix_alignement

#11465 - Limit case duplicate merging comparison based on creation da…
@abrudanancuta abrudanancuta added the qa-verified Issue has been tested and verified by QA label Mar 28, 2023
@abrudanancuta
Copy link
Contributor

Validated on test-de with the version: 1.82.0-SNAPSHOT (2864773)

@sergiupacurariu
Copy link
Contributor

sergiupacurariu commented Apr 13, 2023

One of the changes that improved the performance of the case duplicate merging query was the removal of the disease index from "cases" table. This change can also have an impact on other queries throughout Sormas. Below you can see an evaluation of the query performance in different situations and different relevant users.

### Test impact of disease index - performance db (seconds)

------------------------- ------------With disease index---------- ---------Without disease index---------
National user Limited disease User National user Limited disease User
dashboard access time 8 30 11 29
case grid 4 4 6 4
contacts grid 3 4 4 4
events grid 3 3 5 3
samples grid 5 7 6 6
dashboard with data 4 180 13 90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Affects the web backend cases change A change of an existing feature (ticket type) performance Issues that are meant to increase the performance of the application qa-verified Issue has been tested and verified by QA vaadin-app Affects the Vaadin application
Projects
None yet
4 participants