Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify file carton to deal with duplicates #455

Merged
merged 4 commits into from
Jul 4, 2024

Conversation

albireox
Copy link
Member

@albireox albireox commented Jul 4, 2024

This is a major refactor of the FileCarton class returned by the get_file_carton() function, mainly to deal with duplicate rows for an input identifier. This can happen when a single identifier (e.g., a Gaia DR2 source_id) is associated with more than one catalogid.

When this happens the new code will rank the duplicates by distance (assigning distance=0 if the distance is null, as for phase 1) and select the one with the lowest distance. Internally this is done via a subquery with a window function that partitions on target ID.

A corner case is when all the duplicates have been added as part of phase 1 and all the entries have null distance. In this case the code will select the entry with the lowest catalogid.

Apart from some code cleaning, the new code also changes a few things:

  • The file data is loaded into a temporary table without using the copy_data() function. Since the file table can be dumped into a CSV safely, it does that internally and copies the CSV, which seems to be a bit faster for large tables.
  • The union of all the subqueries for the different parent catalogues is run as a CTE and then the results are distinct'd on catalogid. This prevents duplicates when the same catalogid is returned by different subqueries.
  • Disables sequential scanning for the file carton queries.

@albireox albireox merged commit 90b0d54 into main Jul 4, 2024
10 checks passed
@albireox albireox deleted the albireox/file-carton-duplicates branch July 4, 2024 18:13
albireox added a commit that referenced this pull request Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant