Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EVA-3435 - Retrieve INSDC accession based on FASTA file and compare against metadata #29

Merged
merged 6 commits into from
Mar 28, 2024

Conversation

apriltuesday
Copy link
Contributor

@apriltuesday apriltuesday commented Mar 20, 2024

Example report:
image

@apriltuesday apriltuesday marked this pull request as ready for review March 20, 2024 15:28
@apriltuesday apriltuesday self-assigned this Mar 20, 2024
Copy link
Member

@tcezard tcezard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looks good. I left a comment mainly about the fact that there might be different references in different analysis

bin/check_fasta_insdc.py Show resolved Hide resolved
Comment on lines 127 to 137
# Only report on metadata concordance if all of the following hold:
# 1) All sequences in FASTA file are INSDC
# 2) At least one compatible assembly accession was found in contig alias
# 3) Found a single assembly accession in the metadata to compare against this FASTA file
# If (3) is missing but (1) and (2) hold, we will still report possible INSDC assemblies.
if all_insdc and possible_assemblies:
results['possible_assemblies'] = possible_assemblies
if all_insdc and possible_assemblies and metadata_insdc:
results['metadata_assembly_compatible'] = (metadata_insdc in possible_assemblies)
results['associated_analyses'] = analyses
results['assembly_in_metadata'] = metadata_insdc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could debate this but in my view we would want check_fasta_insdc to report as much as possible and filter what needs to be reported in the final report.

  • I would report all possible assemblies if any was found.
  • metadata_assembly_compatible per analysis if bool(metadata_insdc) and bool(possible_assemblies)
    I think I'm mostly worried about the case where the assembly fasta contains one non-insdc sequence and all the check fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense, right now it's kind of assuming that the is_insdc check and contig alias checks are always 100% accurate and consistent with each other, which is unlikely... I'll update this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation would allow a report like this:
image
For a case where e.g. a sequence is not supported by the refget endpoint, but according to contig alias it's consistent with the metadata. It will also show two errors if not everything is INSDC and the metadata is not consistent.

The only case where the second check doesn't show up at all in the final report is if we don't find any possible assemblies - I think I was thinking that we're expecting contig alias might be missing things and so we wouldn't want to output an error, but perhaps we should show it as an error anyway so both checks always show up? What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that their will be a number of cases where the contig alias will not have the assembly. We should not show the error in that case.
We probably will come back to this decision if we fell like the contig alias has enough coverage of assemblies we're using. But for the moment this is sensible.

@apriltuesday apriltuesday merged commit 0e73aa8 into EBIvariation:main Mar 28, 2024
1 check passed
@apriltuesday apriltuesday deleted the EVA-3435 branch March 28, 2024 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants