Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add sequence cleanup and filtration functions #20

Open
tomkinsc opened this issue Jun 11, 2020 · 0 comments
Open

add sequence cleanup and filtration functions #20

tomkinsc opened this issue Jun 11, 2020 · 0 comments

Comments

@tomkinsc
Copy link
Member

In order to prepare and normalize data for downstream processing it would be helpful to have CLI functions that accept a messy input fasta (possibly including external sequences) and return a fasta suitable for multiple alignment.

Such functionality could serially apply a series of transform functions to sequences read in via Bio.SeqIO() to avoid writing temp files after each filter stage. The optional parameters of each transform should be exposed to the user. Some transforms will be NOPs if they do not apply to the input, and should just pass through the input. Some transforms may reject sequences, and return None. In such cases, the sequence processor should continue to the next sequence (i.e. transforms returning None mean a sequence is omitted from the output, but the overall filtering proceeds normally).

Some transforms would only be appropriate to post-process an alignment.

Transform functions should probably be members of a common class to store global information (number of seqs rejected, metrics, seq names seen, etc.). It may be necessary to iterate over a fast file once to apply transforms based on stats of a dataset.

The idea is that these transforms can be applied in a default order to perform basic cleaning of most datasets, or composited in a custom way specific to particular datasets.

In all cases, the transform functions should:

  • accept one Bio.Seq()
  • return one Bio.Seq.Seq()

Potentially useful transforms (in approx. order of use):

  • trim_description(in_seq,prefix_strip="",suffix_strip="")
  • replace_spaces_in_description(in_seq, replacement_char="")
  • description_change_via_regex(in_seq,regex=r"")
  • include_on_description_match(in_seq, regex=r"")
  • exclude_on_description_match(in_seq, regex=r"")
  • start_date(in_seq,date=None)
    • Parse from description in format datetime.datetime.strptime(s, '%Y-%m-%d')
    • include if > this date
  • end_date(in_seq,date=None)
    • Parse from description in format datetime.datetime.strptime(s, '%Y-%m-%d')
    • include if < this date
  • reject_sequences_by_length(in_seq,min_len=None,max_len=None)
    • NOP if min_len and max_len not specified
  • remove_spaces_in_sequences(in_seq, repalce_with_ambig=False)
    • I think biopython ignores whitespace, but maybe we should have an option to replace with "N" for the handful of external sequences that use spaces to represent gaps/missing data
  • reject_if_description_seen(in_seq)
    • requires storage of description hashes in containing class
  • reject_if_sequence_seen(in_seq)
    • requires storage of seq hashes in containing class
  • include_coordinates(in_seq,[(start_bp,end_bp),...])
    • one-indexed coords
    • ranges are concatenated

For post-processing an alignment:

  • mask_sites(in_seq, sites=None, ranges=None, bed_file)
  • ungap_sequence(in_seq)
  • reject_with_gap_fraction(threshold=0.0)
  • reject_with_ambig_fraction(threshold=0.0)

Perhaps a parameter should be exposed to allow additional transforms to be specified by the user in an arbitrary python file loaded dynamically via importlib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant