Modify file carton to deal with duplicates #455
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a major refactor of the
FileCarton
class returned by theget_file_carton()
function, mainly to deal with duplicate rows for an input identifier. This can happen when a single identifier (e.g., a Gaia DR2 source_id) is associated with more than one catalogid.When this happens the new code will rank the duplicates by distance (assigning
distance=0
if the distance is null, as for phase 1) and select the one with the lowest distance. Internally this is done via a subquery with a window function that partitions on target ID.A corner case is when all the duplicates have been added as part of phase 1 and all the entries have null distance. In this case the code will select the entry with the lowest catalogid.
Apart from some code cleaning, the new code also changes a few things:
copy_data()
function. Since the file table can be dumped into a CSV safely, it does that internally and copies the CSV, which seems to be a bit faster for large tables.