Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qapa quant / create_merged_data.R - Error splitting name field when gene IDs have _PAR_Y suffix #49

Closed
SamBryce-Smith opened this issue Mar 6, 2023 · 1 comment

Comments

@SamBryce-Smith
Copy link

SamBryce-Smith commented Mar 6, 2023

Hi,

I was following the workflow to create the 'standard' reference library, but instead starting from an updated gencode reference GTF (human v40) to define initial 3'UTR regions. I followed the suggested workflow (very nicely detailed btw!) as described and the build & salmon index/quant steps worked flawlessly.

However, when running qapa quant, I came across the following error:

Separating Ensembl IDs
Error in `[.data.table`(df, , `:=`(c("Transcript", "Gene", "Species",  :
  Supplied 9 columns to be assigned 11 items. Please see NEWS for v1.12.2.
Calls: separate_ensembl_field -> [ -> [.data.table
Execution halted

Digging into my outputs a little further, I realised this comes from rare gene IDs in the 'Name' field of quant.sf files that contain a '_PAR_Y' suffix. e.g.

ENST00000355432_ENSG00000198223.18_PAR_Y,ENST00000381529_ENSG00000198223.18_PAR_Y,ENST00000494969_ENSG00000198223.18_PAR_Y_hsa_chrY_1309401_1309921_+_utr_1309868_1309921::chrY:1309401-1309921(+)

The extra underscores in PAR_Y lead to a few extra columns being produced from the strsplit('_') call which causes the error.

Adding a step to e.g. replace underscores with full stops fixes the bug.

# Replace _PAR_Y suffix of gene IDs if applicable - since split by '_', need to replace first to avoid a parsing error (but still retain distinct ID)
    # Format Ensembl Transcript and Ensembl Gene IDs if there are multiple
    # Remove utr tag
    df[, Transcript := str_replace_all(Transcript , "_PAR_Y", ".PARY") %>%
           str_extract(tx_pattern) %>%
           format_multi_ensembl_ids() %>%
           #str_replace("_(hg\\d+|mm\\d+|unk)", "") %>%
           str_replace("_utr", "")]

Output of create_merged_data.R looks good for these events e.g. SLC25A6:

Transcript Gene Gene_Name Chr LastExon.Start LastExon.End Strand UTR3.Start UTR3.End Length ctl_1
ENST00000381401 ENSG00000169100 SLC25A6 chrX 1386139 1386759 - 1386139 1386601 462 159.140895
ENST00000381401 ENSG00000169100 SLC25A6 chrX 1386251 1386759 - 1386251 1386601 350 0
ENST00000381401 ENSG00000169100 SLC25A6 chrX 1386293 1386759 - 1386293 1386601 308 450.639916
ENST00000381401 ENSG00000169100 SLC25A6 chrY 1386151 1386759 - 1386151 1386601 450 41.407935

Although the gene_id suffix is lost in the output table, providing the table as input to compute_pau.R still produces sensible PAU estimates for these genes (i.e. PAU is computed separately for the non-suffixed and suffixed genes respectively). The NumEvents should maybe be adjusted for the different chromosomes but I didn't get around to trying to fix that.

APA_ID Gene_Name Num_Events Transcript Gene Chr LastExon.Start LastExon.End Strand UTR3.Start UTR3.End Length ctl_1.PAU
ENSG00000169100_1_D SLC25A6 4 ENST00000381401 ENSG00000169100 chrX 1386139 1386759 - 1386139 1386601 462 26.098
ENSG00000169100_2_D SLC25A6 4 ENST00000381401 ENSG00000169100 chrX 1386251 1386759 - 1386251 1386601 350 0
ENSG00000169100_3_P SLC25A6 4 ENST00000381401 ENSG00000169100 chrX 1386293 1386759 - 1386293 1386601 308 73.902
ENSG00000169100_4_S SLC25A6 4 ENST00000381401 ENSG00000169100 chrY 1386151 1386759 - 1386151 1386601 450 100

Hope this is clear. Happy to share more tables, information or submit a PR if you wish :) Just a few more general comments:

  • In the README ('GENCODE gene prediction annotation table' section) you pull from Gencode 'basic' annotation only, whereas I've used the 'comprehensive gene annotation' GTF (reference chromosomes only). Do you advise against using the comprehensive annotation for any reason(s)?
    • SLC25A6 does have the 'basic' tag even for _PAR_Y (e.g. ENST00000381401.11_PAR_Y), so this wouldn't necessarily avoid the problem in this case

Thanks,
Sam

@kcha
Copy link
Collaborator

kcha commented Mar 8, 2023

Hi Sam,

Thanks for bringing this up. It looks like you identified something similar to a related issue:
#13 (comment). The difference being I only provided an interim solution. A PR would be welcome!

For choice of Gencode set, I don't really have a great answer other than it was easier to deal with a simpler annotation set at the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants