Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter_tx_by_three_end - updating to atlas 3'end has unexpected(?) behaviour when polyA sites are intervals not single nucleotides #50

Open
SamBryce-Smith opened this issue Jul 4, 2023 · 0 comments

Comments

@SamBryce-Smith
Copy link
Member

gr_exact = gr.subset(lambda df: df[distance_col] == 0)

if the nearest atlas site is 0 i.e. directly overlapping, then the 3' coordinate is kept as is.

With the standard PolyASite BED file (and generally pas from 3'seq), polyA sites are typically represented as clusters (i.e. a region) rather than a single coordinate. This means that different predicted 3'coordinates can overlap with the same atlas site and not be updated.

This also means that the updating strategy will use the 3'most coordinate of the nearest cluster, not necessarily the 'representative coordinate' for that cluster (the position with the highest read support within the cluster). This may not be the optimal behaviour

This is probably acceptable within a single experiment, but makes things a little more complicated when combining predicted last exons across experiments. e.g. when generating BEDs of representative PAS for last exons, because multiple closely spaced 3'ends will be predicted for the same atlas site. Would also lead to effective duplication (i.e. exons differing by a few nucleotides), which is probably not a good thing for Salmon's index size (although shouldn't have an effect in practice for differential usage as the isoform expressions will be summed together).

First thoughts:

  • Behaviour of updating strategy should be at the very least documented.
  • Provide steps/example code to convert representative PolyASite coordinates to single-nucleotide BEDs if desired (prefer this to specifying code within script to extract rep coord as less general)
  • Note that using clusters over single nucleotides may be unfair to some sites over others, e.g. if look +/- 100nt of a 15nt vs 5nt cluster - searching for matches at larger windows in some cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant