Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seqnames in GRCh38 Graph (minigraph-cactus) to match gene annotation #28

Open
pclavell opened this issue May 14, 2024 · 8 comments
Open

Comments

@pclavell
Copy link

Hello, I am running vg autoindex to splice the minigraph-cactus full pangenome according to GENCODE v44 gene annotations in order to map RNA-seq reads. I have two questions:

  1. By running the following command I receive a below shown error:
    vg autoindex \ --workflow mpmap \ --prefix data/00_autoindex/splicedpangenome \ --gfa /gpfs/projects/bsc83/Data/assemblies/pangenome/minigraph_cactus/hprc-v1.1-mc-grch38.full.gfa \ --tx-gff /gpfs/projects/bsc83/Data/gene_annotations/gencode/v44/modified/gencode.v44.chr_patch_hapl_scaff.annotation_chr2GRCh38#chr.gtf \ --tmp-dir temporary \ --threads 112 \ --verbosity 2
    Error:
    Saving GBWT and GBWTGraph to temporary/vg-ikdYP8/dir-MgGI5j/d0cc1cf507d88bdebe898d1ba90127a241a83700.gbz [IndexRegistry]: Adding splice junctions to GBZ-format graph. ERROR: Chromosome path "chr1" not found in graph or haplotypes index (line 6).

When I first saw this I thought that it was the typical error where chromosomes are differently formatted (chr1 or 1) so I looked in the minigraph-cactus reference and found SN:Z:GRCh38#chr1 so I changed the seqnames in the gene annotation from chr1 to GRCh38#chr1 but still I keep getting the same error. Which seqnames is this pangenome reference using?

  1. As GENCODE v44 annotation is built on GRCh38.p14 I am wondering if it is compatible with the minigraph-cactus pangenome references you built.

Thanks

@glennhickey
Copy link
Collaborator

attn: @jeizenga

@jeizenga
Copy link

Are you able to share the GTF that you were using? Even the first few hundred lines would probably be sufficient.

@pclavell
Copy link
Author

You can download it from this link (obtained from the gencode webpage): https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.chr_patch_hapl_scaff.annotation.gtf.gz

@ldammer
Copy link

ldammer commented Jun 10, 2024

Hello,

I was wondering if you found a solution to this issue. I'm getting the same error and I tried multiple annotations, such as the Gencode one mentioned here, as well as annotations from ncbi and ucsc.
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/

In all cases the code crashes with the same error mentioned above

@pclavell
Copy link
Author

Hello,
no, I couldn't solve it. I am waiting for the developers answer.

@jeizenga
Copy link

Hi, apologies for the delay--my union has been on strike and I'm only just returning to work. TLDR you can prepend GRCh38#0# to the contig names in the GTF using sed, and it should then run through.

The GFA you're pointing to stores the reference genome as a particular "sample" alongside other samples that have identifiers like HG0xxxx. The combination of a sample+haplotype+contig is specified using the PanSN naming specification, which look something like this:

GRCh38#0#chr1

The first field is the sample identifier (GRCh38), the second is the haplotype (0, which is somewhat redundant for references that don't have a diplotype), and the third is the contig (chr1).

@yangyaxi4444
Copy link

Hello, I was wondering if the version of annotation matters here?

@jeizenga
Copy link

Different versions necessarily give different results, since they have different transcript sets. The contig naming requirements should be the same though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants