-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EVA-3435 - Retrieve INSDC accession based on FASTA file and compare against metadata #29
Merged
Merged
Changes from 5 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
b8b468d
WIP - check fasta against metadata reference assembly
apriltuesday ba5171d
support multiple vcf files and analyses per fasta file
apriltuesday b80b078
update report for fasta checks
apriltuesday 079226a
fix typo, support paginated responses
apriltuesday 8046387
bugfixes
apriltuesday cdedd83
address review comments
apriltuesday File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,30 +1,73 @@ | ||
|
||
{% macro fasta_check_report(validation_results) -%} | ||
{% for fasta_file, results_for_fasta in validation_results.get('fasta_check', {}).items() %} | ||
{% if results_for_fasta.get('all_insdc') %} | ||
{% set icon = "✔" %} | ||
{% set row_class = "report-section pass" %} | ||
{% set text = "all sequences are INSDC accessioned" %} | ||
{% macro fasta_check_report(validation_results, file_name) -%} | ||
{% set results_for_fasta = validation_results['fasta_check'][file_name] %} | ||
|
||
<!-- All INSDC check results --> | ||
{% if results_for_fasta.get('all_insdc') %} | ||
{% set icon = "✔" %} | ||
{% set row_class = "report-section pass" %} | ||
{% set text = "All sequences are INSDC accessioned" %} | ||
{% else %} | ||
{% set icon = "❌" %} | ||
{% set row_class = "report-section fail collapsible" %} | ||
{% set text = "Some sequences are not INSDC accessioned" %} | ||
{% endif %} | ||
<div class='{{ row_class }}'>{{ icon }} {{ text }} </div> | ||
{% if not results_for_fasta.get('all_insdc') %} | ||
<div class="error-list"> | ||
<div class="error-description">First 10 sequences not in INSDC. <strong>Full report:</strong> {{ results_for_fasta.get('report_path', '') }}</div> | ||
<table> | ||
<tr> | ||
<th>Sequence name</th><th>Refget md5</th> | ||
</tr> | ||
{% set sequence_info_list = results_for_fasta.get('sequences', [])|rejectattr("insdc")|list %} | ||
{% for sequence_info in sequence_info_list[:10] %} | ||
<tr> | ||
<td><strong>{{sequence_info.get('sequence_name') }}</strong></td><td> {{ sequence_info.get('sequence_md5') }}</td> | ||
</tr> | ||
{% endfor %} | ||
</table> | ||
</div> | ||
{% endif %} | ||
|
||
<!-- INSDC concordance check results (optional) --> | ||
{% if 'possible_assemblies' in results_for_fasta %} | ||
{% if 'metadata_assembly_compatible' in results_for_fasta %} | ||
<!-- found assembly in metadata, so definitely know the analysis --> | ||
{% set analysis_text = results_for_fasta.get('associated_analyses')|join(", ") %} | ||
{% if results_for_fasta.get('metadata_assembly_compatible') %} | ||
{% set icon = "✔" %} | ||
{% set row_class = "report-section pass" %} | ||
{% set text = analysis_text + ": Assembly accession in metadata is compatible" %} | ||
{% else %} | ||
{% set icon = "❌" %} | ||
{% set row_class = "report-section fail collapsible" %} | ||
{% set text = analysis_text + ": Assembly accession in metadata is not compatible" %} | ||
{% endif %} | ||
{% else %} | ||
<!-- has possible assemblies but no metadata assembly --> | ||
{% set icon = "❌" %} | ||
{% set row_class = "report-section fail collapsible" %} | ||
{% set text = "some sequences are not INSDC accessioned" %} | ||
{% set text = "No assembly accession found in metadata" %} | ||
{% endif %} | ||
<div class='{{ row_class }}'>{{ icon }} {{ fasta_file }}: {{ text }} </div> | ||
{% if not results_for_fasta.get('all_insdc') %} | ||
<div class='{{ row_class }}'>{{ icon }} {{ text }} </div> | ||
{% if 'metadata_assembly_compatible' not in results_for_fasta or not results_for_fasta['metadata_assembly_compatible'] %} | ||
<div class="error-list"> | ||
<div class="error-description">First 10 sequences not in INSDC. <strong>Full report:</strong> {{ results_for_fasta.get('report_path', '') }}</div> | ||
<div class="error-description"><strong>Full report:</strong> {{ results_for_fasta.get('report_path', '') }}</div> | ||
<table> | ||
<tr> | ||
<th>Sequence name</th><th>Refget md5</th> | ||
<th>Category</th><th>Accessions</th> | ||
</tr> | ||
{% set sequence_info_list = results_for_fasta.get('sequences', [])|rejectattr("insdc")|list %} | ||
{% for sequence_info in sequence_info_list[:10] %} | ||
<tr> | ||
<td><strong>{{sequence_info.get('sequence_name') }}</strong></td><td> {{ sequence_info.get('sequence_md5') }}</td> | ||
</tr> | ||
{% endfor %} | ||
|
||
<tr> | ||
<td><strong>Assembly accession found in metadata</strong></td> | ||
<td>{{ results_for_fasta.get('assembly_in_metadata', 'Not found') }}</td> | ||
</tr> | ||
<tr> | ||
<td><strong>Assembly accession(s) compatible with FASTA</strong></td> | ||
<td>{{ results_for_fasta.get('possible_assemblies')|join(", ") }}</td> | ||
</tr> | ||
</table> | ||
</div> | ||
{% endif %} | ||
{% endfor %} | ||
{% endif %} | ||
{%- endmacro %} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
import os | ||
from collections import defaultdict | ||
|
||
from ebi_eva_common_pyutils.logger import logging_config | ||
|
||
logger = logging_config.get_logger(__name__) | ||
|
||
|
||
def get_samples_per_analysis(metadata): | ||
"""Returns mapping of analysis alias to sample names, based on metadata.""" | ||
samples_per_analysis = defaultdict(list) | ||
for sample_info in metadata.get('sample', []): | ||
samples_per_analysis[sample_info.get('analysisAlias')].append(sample_info.get('sampleInVCF')) | ||
return { | ||
analysis_alias: set(samples) | ||
for analysis_alias, samples in samples_per_analysis.items() | ||
} | ||
|
||
|
||
def get_files_per_analysis(metadata): | ||
"""Returns mapping of analysis alias to filenames, based on metadata.""" | ||
files_per_analysis = defaultdict(list) | ||
for file_info in metadata.get('files', []): | ||
if file_info.get('fileType') == 'vcf': | ||
files_per_analysis[file_info.get('analysisAlias')].append(file_info.get('fileName')) | ||
return { | ||
analysis_alias: set(filepaths) | ||
for analysis_alias, filepaths in files_per_analysis.items() | ||
} | ||
|
||
|
||
def get_reference_assembly_for_analysis(metadata, analysis_alias): | ||
"""Returns the reference assembly for this analysis (does not validate format).""" | ||
for analysis in metadata.get('analysis', []): | ||
if analysis.get('analysisAlias') == analysis_alias: | ||
return analysis.get('referenceGenome') | ||
return None | ||
|
||
|
||
def get_analysis_for_vcf_file(vcf_file, files_per_analysis): | ||
"""Returns list of analysis aliases associated with the vcf file path.""" | ||
if not os.path.exists(vcf_file): | ||
raise FileNotFoundError(f'{vcf_file} cannot be resolved') | ||
analysis_aliases = [analysis_alias for analysis_alias in files_per_analysis | ||
if os.path.basename(vcf_file) in files_per_analysis[analysis_alias]] | ||
return analysis_aliases |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could debate this but in my view we would want
check_fasta_insdc
to report as much as possible and filter what needs to be reported in the final report.metadata_assembly_compatible
per analysisif bool(metadata_insdc) and bool(possible_assemblies)
I think I'm mostly worried about the case where the assembly fasta contains one non-insdc sequence and all the check fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense, right now it's kind of assuming that the is_insdc check and contig alias checks are always 100% accurate and consistent with each other, which is unlikely... I'll update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation would allow a report like this:
For a case where e.g. a sequence is not supported by the refget endpoint, but according to contig alias it's consistent with the metadata. It will also show two errors if not everything is INSDC and the metadata is not consistent.
The only case where the second check doesn't show up at all in the final report is if we don't find any possible assemblies - I think I was thinking that we're expecting contig alias might be missing things and so we wouldn't want to output an error, but perhaps we should show it as an error anyway so both checks always show up? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that their will be a number of cases where the contig alias will not have the assembly. We should not show the error in that case.
We probably will come back to this decision if we fell like the contig alias has enough coverage of assemblies we're using. But for the moment this is sensible.