You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to prepare and normalize data for downstream processing it would be helpful to have CLI functions that accept a messy input fasta (possibly including external sequences) and return a fasta suitable for multiple alignment.
Such functionality could serially apply a series of transform functions to sequences read in via Bio.SeqIO() to avoid writing temp files after each filter stage. The optional parameters of each transform should be exposed to the user. Some transforms will be NOPs if they do not apply to the input, and should just pass through the input. Some transforms may reject sequences, and return None. In such cases, the sequence processor should continue to the next sequence (i.e. transforms returning None mean a sequence is omitted from the output, but the overall filtering proceeds normally).
Some transforms would only be appropriate to post-process an alignment.
Transform functions should probably be members of a common class to store global information (number of seqs rejected, metrics, seq names seen, etc.). It may be necessary to iterate over a fast file once to apply transforms based on stats of a dataset.
The idea is that these transforms can be applied in a default order to perform basic cleaning of most datasets, or composited in a custom way specific to particular datasets.
In all cases, the transform functions should:
accept one Bio.Seq()
return one Bio.Seq.Seq()
Potentially useful transforms (in approx. order of use):
I think biopython ignores whitespace, but maybe we should have an option to replace with "N" for the handful of external sequences that use spaces to represent gaps/missing data
reject_if_description_seen(in_seq)
requires storage of description hashes in containing class
reject_if_sequence_seen(in_seq)
requires storage of seq hashes in containing class
Perhaps a parameter should be exposed to allow additional transforms to be specified by the user in an arbitrary python file loaded dynamically via importlib.
The text was updated successfully, but these errors were encountered:
In order to prepare and normalize data for downstream processing it would be helpful to have CLI functions that accept a messy input fasta (possibly including external sequences) and return a fasta suitable for multiple alignment.
Such functionality could serially apply a series of transform functions to sequences read in via
Bio.SeqIO()
to avoid writing temp files after each filter stage. The optional parameters of each transform should be exposed to the user. Some transforms will be NOPs if they do not apply to the input, and should just pass through the input. Some transforms may reject sequences, and returnNone
. In such cases, the sequence processor should continue to the next sequence (i.e. transforms returningNone
mean a sequence is omitted from the output, but the overall filtering proceeds normally).Some transforms would only be appropriate to post-process an alignment.
Transform functions should probably be members of a common class to store global information (number of seqs rejected, metrics, seq names seen, etc.). It may be necessary to iterate over a fast file once to apply transforms based on stats of a dataset.
The idea is that these transforms can be applied in a default order to perform basic cleaning of most datasets, or composited in a custom way specific to particular datasets.
In all cases, the transform functions should:
Bio.Seq()
Bio.Seq.Seq()
Potentially useful transforms (in approx. order of use):
trim_description(in_seq,prefix_strip="",suffix_strip="")
replace_spaces_in_description(in_seq, replacement_char="")
description_change_via_regex(in_seq,regex=r"")
include_on_description_match(in_seq, regex=r"")
exclude_on_description_match(in_seq, regex=r"")
start_date(in_seq,date=None)
datetime.datetime.strptime(s, '%Y-%m-%d')
end_date(in_seq,date=None)
datetime.datetime.strptime(s, '%Y-%m-%d')
reject_sequences_by_length(in_seq,min_len=None,max_len=None)
min_len
andmax_len
not specifiedremove_spaces_in_sequences(in_seq, repalce_with_ambig=False)
reject_if_description_seen(in_seq)
reject_if_sequence_seen(in_seq)
include_coordinates(in_seq,[(start_bp,end_bp),...])
For post-processing an alignment:
mask_sites(in_seq, sites=None, ranges=None, bed_file)
ungap_sequence(in_seq)
reject_with_gap_fraction(threshold=0.0)
reject_with_ambig_fraction(threshold=0.0)
Perhaps a parameter should be exposed to allow additional transforms to be specified by the user in an arbitrary python file loaded dynamically via
importlib
.The text was updated successfully, but these errors were encountered: