Skip to content

schatzlab/crossstitch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrossStitch: Hybrid Phasing and Personal Genome Construction

CrossStitch creates personalized reference-quality diploid genomes without de novo assembly. The basic idea is rather than trying to assemble a genome from scratch, it will leverage a reference genome as a baseline, and then update it with any SNPs, indels, or structural variations present in your sample. For the best results, the data requirements are similar to a de novo assembly: Illumina-based data for SNPs and Indels, Long Read data for structural variants, and Phasing data such as 10X Linked Reads and/or HiC data. However the CrossStitch process is much less demanding, produces more accurate results, and the process is much more predictable. The output will be a phased VCF file with all variants (SNPs, Indels, and SVs) as well as a phased personalized diploid genome including 2 copies of each chromosome with the variants inserted at the correct locations.

Installation


## CrossStitch requires extractHairs from HapCut2
$ git clone https://github.com/vibansal/HapCUT2
$ cd HapCUT2
$ make

$ git clone --recursive https://github.com/schatzlab/crossstitch.git
$ cd vcf2diploid
$ make
$ cd ../Iris
$ build.sh

## After this you will need to add HapCUT2/build/extractHAIRS to your path or edit src/crossstitch.sh with the correct path

Running CrossStitch

Currently only human genomes are supported for diploid genome construction.

$ crossstitch.sh phased_snps.vcf unphased_structural_variants.vcf long_reads.bam genome.fa outputprefix gender refine
 
Details:
  phased_snps.vcf:                   VCF file of phased SNP and indel variants. Recommend LongRanger (10X only) or HapCUT2 (HiC and/or 10X)
  unphased_structural_variants.vcf:  VCF file of structural variants identified using Sniffles
  long_reads.bam:                    BAM file of long reads aligned with NGMLR
  genome.fa:                         Reference genome used
  outputprefix:                      Prefix for output files
  karyotype:                         "xy" or "xx", used to ensure sex chromosomes are correctly used
  refine:                            optionally refine structural variant calls with local assembly (1=refine, 0=skip)

Running Insertion Refinement

The unphased structural variant must include the names of supporting reads for insertions (sniffles -n [x], where x > 0).

To run the insertion refinement on its own, use the Iris submodule. It has dependencies on minimap2, racon, and samtools.

Simulations for testing

Simple example: 5 indels, zero other SVs

$ cd crossstitch/simul/simple
$ ../sim_example.sh 1

Complex example: 20 indels, 10 inversions

$ cd crossstitch/simul/fiftyinv
$ ../sim_example.sh 1

Creating you own simulations

Create/copy the simul.param file from the simple example with more/less SVs of different types. Then run ../sim_example.sh

Output files

Main Files:

*.hap1.fa.gz: Haplotype 1 chromosome fasta sequences 
*.hap2.fa.gz: Haplotype 2 chromosome fasts sequences 
*.spliced.scrubbed.vcf.gz: Finalized set of phased small and structural variants 

Annotation files:

*.map:   liftover file to relate coordinates on the personalized assembly to the reference (such as GRCh38) 
*.chain: liftover file to relate coordinates on the personalized assembly to the reference (such as GRCh38) 

About

Code for phasing SVs with SNPs

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •