add sequence cleanup and filtration functions #20

tomkinsc · 2020-06-11T17:56:36Z

In order to prepare and normalize data for downstream processing it would be helpful to have CLI functions that accept a messy input fasta (possibly including external sequences) and return a fasta suitable for multiple alignment.

Such functionality could serially apply a series of transform functions to sequences read in via Bio.SeqIO() to avoid writing temp files after each filter stage. The optional parameters of each transform should be exposed to the user. Some transforms will be NOPs if they do not apply to the input, and should just pass through the input. Some transforms may reject sequences, and return None. In such cases, the sequence processor should continue to the next sequence (i.e. transforms returning None mean a sequence is omitted from the output, but the overall filtering proceeds normally).

Some transforms would only be appropriate to post-process an alignment.

Transform functions should probably be members of a common class to store global information (number of seqs rejected, metrics, seq names seen, etc.). It may be necessary to iterate over a fast file once to apply transforms based on stats of a dataset.

The idea is that these transforms can be applied in a default order to perform basic cleaning of most datasets, or composited in a custom way specific to particular datasets.

In all cases, the transform functions should:

accept one Bio.Seq()
return one Bio.Seq.Seq()

Potentially useful transforms (in approx. order of use):

trim_description(in_seq,prefix_strip="",suffix_strip="")
replace_spaces_in_description(in_seq, replacement_char="")
description_change_via_regex(in_seq,regex=r"")
include_on_description_match(in_seq, regex=r"")
exclude_on_description_match(in_seq, regex=r"")
start_date(in_seq,date=None)
- Parse from description in format datetime.datetime.strptime(s, '%Y-%m-%d')
- include if > this date
end_date(in_seq,date=None)
- Parse from description in format datetime.datetime.strptime(s, '%Y-%m-%d')
- include if < this date
reject_sequences_by_length(in_seq,min_len=None,max_len=None)
- NOP if min_len and max_len not specified
remove_spaces_in_sequences(in_seq, repalce_with_ambig=False)
- I think biopython ignores whitespace, but maybe we should have an option to replace with "N" for the handful of external sequences that use spaces to represent gaps/missing data
reject_if_description_seen(in_seq)
- requires storage of description hashes in containing class
reject_if_sequence_seen(in_seq)
- requires storage of seq hashes in containing class
include_coordinates(in_seq,[(start_bp,end_bp),...])
- one-indexed coords
- ranges are concatenated

For post-processing an alignment:

mask_sites(in_seq, sites=None, ranges=None, bed_file)
ungap_sequence(in_seq)
reject_with_gap_fraction(threshold=0.0)
reject_with_ambig_fraction(threshold=0.0)

Perhaps a parameter should be exposed to allow additional transforms to be specified by the user in an arbitrary python file loaded dynamically via importlib.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add sequence cleanup and filtration functions #20

add sequence cleanup and filtration functions #20

tomkinsc commented Jun 11, 2020

add sequence cleanup and filtration functions #20

add sequence cleanup and filtration functions #20

Comments

tomkinsc commented Jun 11, 2020