-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EVA-3435 - Retrieve INSDC accession based on FASTA file and compare against metadata #29
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looks good. I left a comment mainly about the fact that there might be different references in different analysis
bin/check_fasta_insdc.py
Outdated
# Only report on metadata concordance if all of the following hold: | ||
# 1) All sequences in FASTA file are INSDC | ||
# 2) At least one compatible assembly accession was found in contig alias | ||
# 3) Found a single assembly accession in the metadata to compare against this FASTA file | ||
# If (3) is missing but (1) and (2) hold, we will still report possible INSDC assemblies. | ||
if all_insdc and possible_assemblies: | ||
results['possible_assemblies'] = possible_assemblies | ||
if all_insdc and possible_assemblies and metadata_insdc: | ||
results['metadata_assembly_compatible'] = (metadata_insdc in possible_assemblies) | ||
results['associated_analyses'] = analyses | ||
results['assembly_in_metadata'] = metadata_insdc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could debate this but in my view we would want check_fasta_insdc
to report as much as possible and filter what needs to be reported in the final report.
- I would report all possible assemblies if any was found.
metadata_assembly_compatible
per analysisif bool(metadata_insdc) and bool(possible_assemblies)
I think I'm mostly worried about the case where the assembly fasta contains one non-insdc sequence and all the check fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense, right now it's kind of assuming that the is_insdc check and contig alias checks are always 100% accurate and consistent with each other, which is unlikely... I'll update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation would allow a report like this:
For a case where e.g. a sequence is not supported by the refget endpoint, but according to contig alias it's consistent with the metadata. It will also show two errors if not everything is INSDC and the metadata is not consistent.
The only case where the second check doesn't show up at all in the final report is if we don't find any possible assemblies - I think I was thinking that we're expecting contig alias might be missing things and so we wouldn't want to output an error, but perhaps we should show it as an error anyway so both checks always show up? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that their will be a number of cases where the contig alias will not have the assembly. We should not show the error in that case.
We probably will come back to this decision if we fell like the contig alias has enough coverage of assemblies we're using. But for the moment this is sensible.
Example report: