Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for telomere to telomere build - T2T - CHM13v2 #814

Open
davmlaw opened this issue May 4, 2023 · 3 comments
Open

Support for telomere to telomere build - T2T - CHM13v2 #814

davmlaw opened this issue May 4, 2023 · 3 comments
Assignees

Comments

@davmlaw
Copy link
Contributor

davmlaw commented May 4, 2023

Hamish asked whether we support this. Ensembl have released VEP that supports it now

Ensembl/ensembl-vep#1409

Working in feature/t2t_genome_build

@davmlaw
Copy link
Contributor Author

davmlaw commented Dec 3, 2024

VEP fasta:

wget --quiet -O - https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/genome/Homo_sapiens-GCA_009914755.4-softmasked.fa.gz | gzip -d | bgzip > Homo_sapiens-GCA_009914755.4-softmasked.fa.gz
samtools faidx Homo_sapiens-GCA_009914755.4-softmasked.fa.gz

VEP

wget https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/indexed_vep_cache/Homo_sapiens-GCA_009914755.4-2022_10.tar.gz
# This needs to be renamed/copied to "homo_sapiens"
mkdir -p homo_sapiens  # in case already exists w/Ensembl
mv homo_sapiens_gca009914755v4/107_T2T-CHM13v2.0 homo_sapiens
rmdir homo_sapiens_gca009914755v4

Fasta

wget --quiet -O - https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz | gzip -d | bgzip > GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
samtools faidx GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz

Liftover

wget https://hgdownload.soe.ucsc.edu/goldenPath/hs1/liftOver/hs1ToHg19.over.chain.gz https://hgdownload.soe.ucsc.edu/goldenPath/hs1/liftOver/hs1ToHg38.over.chain.gz https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHs1.over.chain.gz https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHs1.over.chain.gz

VEP annotation files

https://hgdownload.soe.ucsc.edu/gbdb/hs1/ - lots of things here - see UCSC browser T2T to see track listings

Clinvar

wget https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/vcf/2024_07/clinvar_20240624_GCA_009914755.4.vcf.gz
python manage.py clinvar_import /data/annotation/variantgrid_setup_data/clinvar/t2t/clinvar_20240624_GCA_009914755.4.vcf.gz

gnomAD

Downloading v4 exomes + genomes from here:

https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/vcf/2024_07/

Will need to combine them

Somalier

wget https://github.com/brentp/somalier/files/9954286/sites.chm13v2.T2T.vcf.gz

cdot

wget https://github.com/SACGF/cdot/releases/download/data_v0.2.27/cdot-0.2.27.ensembl.T2T-CHM13v2.0.json.gz https://github.com/SACGF/cdot/releases/download/data_v0.2.27/cdot-0.2.27-Homo_sapiens_T2T-CHM13v2.0_Ensembl_2022_06.gtf.json.gz
# all transcripts (not sure if has any novel ones over release below)
python3 manage.py import_gene_annotation --genome-build=T2T-CHM13v2.0 --annotation-consortium=Ensembl --json-file cdot-0.2.27.ensembl.T2T-CHM13v2.0.json.gz 
# Then the release
python3 manage.py import_gene_annotation --genome-build=T2T-CHM13v2.0 --annotation-consortium=Ensembl --release T2Tv2_Ensembl_2022_06 --json-file /data/annotation/cdot/Ensembl/T2T-CHM13v2.0/cdot-0.2.27-Homo_sapiens_T2T-CHM13v2.0_Ensembl_2022_06.gtf.json.gz

CONSERVATION

No phastcons/phylop but there is Cactus:

https://hgdownload.soe.ucsc.edu/gbdb/hs1/hgCactus/t2tChm13.v2.0.hal

Cactus reference-free alignments of GRCh38 and T2T CHM13 v2.0, using chimp (GCF_002880755.1/panTro6) as an out-group.

Data comes in HAL: a hierarchical format for storing and analyzing multiple genome alignments - I think you can convert it to phyloP type scores or to bigWig (hal -> halPhyloP -> wig -> bigWig)

REPEATS

repeat masker - https://hgdownload.soe.ucsc.edu/gbdb/hs1/t2tRepeatMasker/

dbNSFP - N/A checked site
dbSCSNV - N/A - but 38 is just lifted over from 37 anyway
COSMIC - N/A checked site
MAVE - N/A
Mastermind - ?
splice AI SNV/indel - ?
topmed - ?
uk10k - ?

Write up how to add genome build

Assembly/contig

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt # Copied into source

A test VCF sample:

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.018-20240716/CHM13v2.0_HG2-T2TQ100-V1.1.vcf.gz

@davmlaw
Copy link
Contributor Author

davmlaw commented Dec 5, 2024

TODO

  • gnomAD v4 is downloading on screen in vg.com once downloaded, combine then test out, set as v4 in T2T annotation VEP config
  • Test annotation in old builds etc still works
  • Find other data?
  • Repeat masker bigbed?
  • Still need to remove columns not in T2T - I think this is OK for VEP imports, but not for the annotation details page
In [17]: for cvf in qs.filter(q).order_by("source_field").filter(Q(vep_custom__isnull=False) | Q(vep_plugin__isnull=False)):
    ...:     print(cvf, cvf.vep_info_field)

@davmlaw
Copy link
Contributor Author

davmlaw commented Dec 10, 2024

View classification page, throws:

  • No cached column for genome build T2T-CHM13v2.0

https://app.rollbar.com/a/jimmy.andrews/fix/item/VariantGrid/5585??utm_source=rollbar-notification

@davmlaw davmlaw self-assigned this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant