You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was following the workflow to create the 'standard' reference library, but instead starting from an updated gencode reference GTF (human v40) to define initial 3'UTR regions. I followed the suggested workflow (very nicely detailed btw!) as described and the build & salmon index/quant steps worked flawlessly.
However, when running qapa quant, I came across the following error:
Separating Ensembl IDs
Error in `[.data.table`(df, , `:=`(c("Transcript", "Gene", "Species", :
Supplied 9 columns to be assigned 11 items. Please see NEWS for v1.12.2.
Calls: separate_ensembl_field -> [ -> [.data.table
Execution halted
Digging into my outputs a little further, I realised this comes from rare gene IDs in the 'Name' field of quant.sf files that contain a '_PAR_Y' suffix. e.g.
The extra underscores in PAR_Y lead to a few extra columns being produced from the strsplit('_') call which causes the error.
Adding a step to e.g. replace underscores with full stops fixes the bug.
# Replace _PAR_Y suffix of gene IDs if applicable - since split by '_', need to replace first to avoid a parsing error (but still retain distinct ID)
# Format Ensembl Transcript and Ensembl Gene IDs if there are multiple
# Remove utr tag
df[, Transcript := str_replace_all(Transcript , "_PAR_Y", ".PARY") %>%
str_extract(tx_pattern) %>%
format_multi_ensembl_ids() %>%
#str_replace("_(hg\\d+|mm\\d+|unk)", "") %>%
str_replace("_utr", "")]
Output of create_merged_data.R looks good for these events e.g. SLC25A6:
Transcript
Gene
Gene_Name
Chr
LastExon.Start
LastExon.End
Strand
UTR3.Start
UTR3.End
Length
ctl_1
ENST00000381401
ENSG00000169100
SLC25A6
chrX
1386139
1386759
-
1386139
1386601
462
159.140895
ENST00000381401
ENSG00000169100
SLC25A6
chrX
1386251
1386759
-
1386251
1386601
350
0
ENST00000381401
ENSG00000169100
SLC25A6
chrX
1386293
1386759
-
1386293
1386601
308
450.639916
ENST00000381401
ENSG00000169100
SLC25A6
chrY
1386151
1386759
-
1386151
1386601
450
41.407935
Although the gene_id suffix is lost in the output table, providing the table as input to compute_pau.R still produces sensible PAU estimates for these genes (i.e. PAU is computed separately for the non-suffixed and suffixed genes respectively). The NumEvents should maybe be adjusted for the different chromosomes but I didn't get around to trying to fix that.
APA_ID
Gene_Name
Num_Events
Transcript
Gene
Chr
LastExon.Start
LastExon.End
Strand
UTR3.Start
UTR3.End
Length
ctl_1.PAU
ENSG00000169100_1_D
SLC25A6
4
ENST00000381401
ENSG00000169100
chrX
1386139
1386759
-
1386139
1386601
462
26.098
ENSG00000169100_2_D
SLC25A6
4
ENST00000381401
ENSG00000169100
chrX
1386251
1386759
-
1386251
1386601
350
0
ENSG00000169100_3_P
SLC25A6
4
ENST00000381401
ENSG00000169100
chrX
1386293
1386759
-
1386293
1386601
308
73.902
ENSG00000169100_4_S
SLC25A6
4
ENST00000381401
ENSG00000169100
chrY
1386151
1386759
-
1386151
1386601
450
100
Hope this is clear. Happy to share more tables, information or submit a PR if you wish :) Just a few more general comments:
In the README ('GENCODE gene prediction annotation table' section) you pull from Gencode 'basic' annotation only, whereas I've used the 'comprehensive gene annotation' GTF (reference chromosomes only). Do you advise against using the comprehensive annotation for any reason(s)?
SLC25A6 does have the 'basic' tag even for _PAR_Y (e.g. ENST00000381401.11_PAR_Y), so this wouldn't necessarily avoid the problem in this case
Thanks,
Sam
The text was updated successfully, but these errors were encountered:
Thanks for bringing this up. It looks like you identified something similar to a related issue: #13 (comment). The difference being I only provided an interim solution. A PR would be welcome!
For choice of Gencode set, I don't really have a great answer other than it was easier to deal with a simpler annotation set at the time.
Hi,
I was following the workflow to create the 'standard' reference library, but instead starting from an updated gencode reference GTF (human v40) to define initial 3'UTR regions. I followed the suggested workflow (very nicely detailed btw!) as described and the build & salmon index/quant steps worked flawlessly.
However, when running
qapa quant
, I came across the following error:Digging into my outputs a little further, I realised this comes from rare gene IDs in the 'Name' field of
quant.sf
files that contain a '_PAR_Y' suffix. e.g.The extra underscores in PAR_Y lead to a few extra columns being produced from the
strsplit('_')
call which causes the error.Adding a step to e.g. replace underscores with full stops fixes the bug.
Output of create_merged_data.R looks good for these events e.g. SLC25A6:
Although the gene_id suffix is lost in the output table, providing the table as input to compute_pau.R still produces sensible PAU estimates for these genes (i.e. PAU is computed separately for the non-suffixed and suffixed genes respectively). The NumEvents should maybe be adjusted for the different chromosomes but I didn't get around to trying to fix that.
Hope this is clear. Happy to share more tables, information or submit a PR if you wish :) Just a few more general comments:
Thanks,
Sam
The text was updated successfully, but these errors were encountered: