Modify file carton to deal with duplicates #455

albireox · 2024-07-04T17:57:14Z

This is a major refactor of the FileCarton class returned by the get_file_carton() function, mainly to deal with duplicate rows for an input identifier. This can happen when a single identifier (e.g., a Gaia DR2 source_id) is associated with more than one catalogid.

When this happens the new code will rank the duplicates by distance (assigning distance=0 if the distance is null, as for phase 1) and select the one with the lowest distance. Internally this is done via a subquery with a window function that partitions on target ID.

A corner case is when all the duplicates have been added as part of phase 1 and all the entries have null distance. In this case the code will select the entry with the lowest catalogid.

Apart from some code cleaning, the new code also changes a few things:

The file data is loaded into a temporary table without using the copy_data() function. Since the file table can be dumped into a CSV safely, it does that internally and copies the CSV, which seems to be a bit faster for large tables.
The union of all the subqueries for the different parent catalogues is run as a CTE and then the results are distinct'd on catalogid. This prevents duplicates when the same catalogid is returned by different subqueries.
Disables sequential scanning for the file carton queries.

albireox added 4 commits July 3, 2024 21:10

Don't use copy_file() and instead copy from CSV

085a3b9

Move copy_data() to method

a1d1e05

Refactor get_file_carton() to deal with duplicates

7767a7f

Make code a bit more readable

8f72e77

albireox merged commit 90b0d54 into main Jul 4, 2024
10 checks passed

albireox deleted the albireox/file-carton-duplicates branch July 4, 2024 18:13

albireox added a commit that referenced this pull request Jul 4, 2024

Update changelog for #455

e3d0dfe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify file carton to deal with duplicates #455

Modify file carton to deal with duplicates #455

albireox commented Jul 4, 2024 •

edited

Loading

Modify file carton to deal with duplicates #455

Modify file carton to deal with duplicates #455

Conversation

albireox commented Jul 4, 2024 • edited Loading

albireox commented Jul 4, 2024 •

edited

Loading